fast kernel-density-based classification and clustering using p-trees anne denton major advisor:...

Fast Kernel-Density-Based Classification and Clustering

Using P-Trees

Anne DentonMajor Advisor: William Perrizo

Outline Introduction P-Trees

Concepts Implementation

Kernel Methods Paper 1: Rule-Based Classification Paper 2: Kernel-Based Semi-Naïve Bayes

Classifier Paper 3: Hierarchical Clustering Outlook

Introduction

Data Mining Information from data Considers storage issues

P-Tree Approach Bit-column-based storage

Compression Hardware optimization Simple index construction Flexibility in storage organization

P-Tree Concepts

Ordering (details) New: Generalized Peano order sorting

Compression

Impact of Peano Order SortingImpact of Sorting on Storage Requirements

0

1000020000

30000

4000050000

60000

adult

spam

mus

hroo

m

func

tion

crop

P-t

ree

No

des

Unsorted

Simple Sorting

Generalized PeanoSorting

Impact of Sorting on Execution Speed

0

20

40

60

80

100

120

adult

spam

mus

hroo

m

func

tion

crop

Tim

e in

Sec

on

ds Unsorted

Simple Sorting

Generalized PeanoSorting

P-Tree Implementation Implementation in Java

Was ported to C / C++ (Amal Perera, Masum Serazi) Fastest compressing P-tree implementation so far

Array indices as pointers (details) Grandchild purity

Kernel-Density-Based Classification

Probability of an attribute vector x Conditional on class label value ci

[] is 1 if is true, 0 otherwise Depending on N training points xt

Kernel function K(x,xt) can be, e.g., Gaussian function or step function

][),(1

)|(1

i

N

tt cCK

NcCPi

xxx

Higher Order Basic Bit Distance HOBbit P-trees make count evaluation efficient for the

following intervals

Paper 1: Rule-Based Classification Goal: High accuracy on large data sets including

standard ones (UCI ML Repository) Neighborhood evaluated through

Equality of categorical attributes HOBbit interval for continuous attributes

Curse of dimensionality Volume empty with high likelihood

Information gain to select attributes Attributes considered binary, based on test sample

(“Lazy decision trees”, Friedman ’96 [4]) Continuous data: Interval around test sample Exact information gain (details) Pursuing multiple paths

Results: Accuracy Comparable to C4.5

after much less development time

5 data sets from UCI Machine Learning Repository (details)

2 additional data sets Crop Gene-function

Improvement through multiple paths (20)

C4.5 20 paths

adult 15.54 14.93

kr-vs-kp 0.8 0.84

mushroom 0 0

20 Paths vs. 1 Path

-2

0

2

4

6

adult spam sick-euthyroid

kr-vs-kp gene-function

crop

Rel

ativ

e A

ccur

acy

Results: Speed Used on largest UCI data sets Scaling of execution time as a function of

training set size

0

20

40

60

80

0 10000 20000 30000

Number of Training Points

Tim

e p

er

Te

st

Sa

mp

le in

M

illis

ec

on

ds

Measured Execution Time

Linear Interpolation

Paper 2: Kernel-Based Semi-Naïve Bayes Classifier

M

ki

ki cCxPcCP

1

)( )|()|(x

Goal: Handling many attributes Naïve Bayes

x(k) is value of kth attribute

Semi-naïve Bayes Correlated attributes are joined Has been done for categorical data

Kononenko ’91 [5], Pazzani ’96 [6] Previously: Continuous data discretized

Kernel-Based Naïve Bayes Alternatives for continuous data

Discretization Distribution function

Gaussian with mean and standard deviation from data

No alternative forsemi-naïve approach

Kernel density estimate (Hastie [7])

0

0.02

0.04

0.06

0.08

0.1

kerneldensityestimate

distributionfunction

data points

Correlations Correlation between

attributes a and bN: Number of training points t

Kernel function forcontinuous datadEH: Exponential HOBbit distance

1

),(

),(

),(

, 1

)()()(

1 ,

)()()(

bak

N

t

kt

kk

N

t bak

kt

kk

xxK

xxK

baCorr

2

2)()()()()(

2

),(exp

2

1),(

k

kt

kEH

k

kt

kkGauss

xxdxxK

Results

-10

0

10

20

30

spam cr

opad

ult

sick-

euth

yroi

d

mus

hroo

m

gene

-functi

onsp

lice

wavef

orm

kr-v

s-kp

Dec

reas

e in

Err

or

Rat

e

P-tree Naïve Bayes Difference only for

continuous dataSemi-Naïve Bayes 3 parameter

combinations Blue: t = 1

3 iterations Red: t = 0.3

incl. anti-corr. White: t = 0.05(t: threshold)

-4

0

4

8

12

spam crop adult sick-euthyroid

waveform

De

cre

as

e in

Err

or

Ra

te

Paper 3: Hierarchical Clustering [10] Goal:

Understand relationship between standard algorithms

Combine the “best” aspects of three major ones Partition-based

Relationship to k-medoids [8] demonstrated Same cluster boundary definition

Density-based (kernel-based, DENCLUE [9]) Similar cluster center definition

Hierarchical Follows naturally from above definitions

Results: Speed Comparison with K-Means

1

10

100

1000

10000

1000 10000 100000 1000000 10000000

K-Means

Leaf Node in Hierarchy Internal Node

Results: Clustering Effectiveness

1 4 7 10 13 16 19 22 25 28 31S1

S4

S7

S10

S13

S16

S19

S22

S25

S28

S31

Attribute 1

Attrib

ute

2

Histogram

20-25

15-20

10-15

5-10

0-5

K-means

Our Algorithm

Summary P-tree representation for non-spatial data Fast implementation Paper1: Rule-Based Algorithm

Test-sample-centered intervals, multiple paths Competitive on “standard” (UCI) data

Paper 2: Kernel-Based Semi-Naïve Bayes New algorithm to handle large attribute numbers Attribute joining shown to be beneficial

Paper 3: Hierarchical Clustering [10] Competitive for speed and effectiveness Hierarchical structure

Outlook

Software engineering aspects Column-oriented design Relationship with P-tree API

“Non-standard” data Data with graph structure Hierarchical data, concept slices [11]

Visualization Visualization of data on a graph

Software Engineering Business-problems row-based

Match between database tables and objects Scientific / engineering problems column-

based Collective properties of interest Standard OO unsuitable, instead

Fortran Array-based languages (ZPL)

Solution? Design pattern? Library?

Ptree API

“Non-standard” Data Types of data

Biological (KDD-cup ’02: Our team got honorary mention!)

Sensor Networks Types of problems

Small probability of minority class label A-ROC Evaluation Multi-valued attributes Bit-vector representation ideal for P-trees Graphs Rich supply of new problems / techniques (work with

Chris Besemann) Hierarchical categorical attributes [11]

Visualization

Idea: Use graph visualization tool

E.g. http://www.touchgraph.com/ Visualize node data through glyphs Visualize edge data

fast kernel-density-based classification and clustering using p-trees anne denton major advisor:...

Documents

mushroom00 slide

threshold slide

step function slide

attributes attributes

storage organization

continuous attributes

attributes nave bayes

large data sets