handling numeric attributes in hoeffding trees bernhard pfahringer, geoff holmes and richard kirkby
TRANSCRIPT
Handling Numeric Attributes in
Hoeffding Trees
Bernhard Pfahringer, Geoff Holmes and Richard Kirkby
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Overview
• Hoeffding trees are excellent for classification tasks on data streams.
• Handling numeric attributes well is crucial to performance in conventional decision trees (for example, C4.5 -> C4.8)
• Does handling numeric attributes matter for streamed data?
• We implement a range of methods and empirically evaluate their accuracy and costs.
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Data Streams - reminder
• Idea is that data is being provided from a continuous source:– Examples processed one at a time (inspected once)– Memory is limited (!)– Model construction must scale (NlogN in num examples)– Be ready to predict at any time
• As memory is limited this will have implications for any numeric handling method you might construct
• Only consider methods that work as the tree is built
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Main assumptions/limitations
• Assume a stationary concept, i.e. no concept drift or change– may seem very limiting, but …
• Three-way trade-off:– memory– speed– accuracy
• Used only artificial data sources
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Hoeffding Trees
• Introduced by Domingos and Hulten (VFDT)• “Extension” of decision trees to streams• HT Algorithm:
– Init tree T to root node– For each example from stream
• Find leaf L for this example• Update counts in L with attr values of example and compute split
function (eg Info Gain, IG) for each attribute• If IG(best attr) – IG(next best attr) > ε then split L on best attr
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Active leaf data structure
• For each class value:– for each nominal attribute:
• for each possible value:– keep sum of counts/weights
– for each numeric attribute:• keep sufficient stats to approximate the distribution• various possibilities: here assume normal distribution so
estimate/record: n,mean,variance, + min/max
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Numeric Handling Methods
• VFDT (VFML – Hulten & Domingos, 2003) – Summarize the numeric distribution with a histogram made up of a
maximum number of bins N (default 1000)– Bin boundaries determined by first N unique values seen in the
stream.– Issues: method sensitive to data order and choosing a good N for a
particular problem
• Exhaustive Binary Tree (BINTREE – Gama et al, 2003)– Closest implementation of a batch method– Incrementally update a binary tree as data is observed– Issues: high memory cost, high cost of split search, data order
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Numeric Handling Methods
• Quantile Summaries (GK – Greenwald and Khanna, 2001)– Motivation comes from VLDB– Maintain sample of values (quantiles) plus range of
possible ranks that the samples can take (tuples)– Extremely space efficient– Issues: use max number of tuples per summary
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Handling Numeric Methods
• Gaussian Approximation (GAUSS)– Assume values conform to Normal Distribution– Maintain five numbers (eg mean, variance, weight, max,
min)– Note: not sensitive to data order– Incrementally updateable– Using the max, min information per class – split the
range into N equal parts– For each part use the 5 numbers per class to compute
the approx class distribution• Use the above to compute the IG of that split
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Gaussian approximation – 2 class problem
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Gaussian approximation – 3 class problem
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Gaussian approximation – 4 class problem
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Empirical Evaluation
• Use each numeric handling method (8 in total) to build a Hoeffding Tree (HTMC)
• Vary parameters of some methods (VFML10,100,1000; BT; GK100,1000; GAUSS10,100)
• Train models for 10 hours – then test on one million (holdout) examples
• Define three application scenarios– Sensor network (100K memory limit)– Handheld (32MB)– Server (400MB)
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Data generators
• Random tree (Domingos&Hulten):– (RTS) 10 num, 10 nom 5 values, 2 classes, leaves start at level 3,
max level 5, plus version with 10% noise added (RTSN)– (RTC) 50 num, 50 nom 5 values, 2 classes, leaves start at level 5,
max level 10, plus version with 10% noise added (RTCN)
• Random RBF (Kirkby):– (RRBFS) 10 num, 100 centers, 2 classes– (RRBFC) 50 num, 1000 centers, 2 classes
• Waveform (Aha): – (Wave21): 21 noisy num, (Wave40): +19 irrelevant num; 3 classes
• (GenF1-GenF10) (Agrawal etal): – hypothetical loan applications, 10 different rule(s) over 6 num + 3
nom attrs, 5% noise, 2 classes
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Tree Measurements
• Accuracy (% correct)• Number of training examples processed in 10
hours (in millions)• Number of active leaves (in hundreds)• Number of inactive leaves (in hundreds)• Total nodes (in hundreds)• Tree depth• Training speed (% of generation speed)• Prediction speed (% of generation speed)
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Sensor Network (100K memory limit)Method %
correctTrain
(million)
Active
Leaves
Inactive
(hdrds)
Total
Nodes
AvgTree
Depth
Train
Spd %
Pred
Spd %
VF10 87.7 21 0 8.13 10.6 11 70 82
VF100 79.47 13 0 3.65 4.5 7 76 85
VF1000 76.06 1 0 0.09 0.14 3 81 88
BT 74.45 1 0 0.07 0.11 3 75 89
GK100 82.92 12 0 4.03 5.03 8 71 84
GK1000 74.65 1 0 0.08 0.13 3 60 88
GAUSS10
86.16 20 0 8.87 12.1 12 68 81
GAUSS100
85.33 16 0.01 8.08 11.7 20 64 79
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Handheld Environment (32MB memory limit)
Method % correct
Train
(million)
Active
Leaves
Inactive
(hdrds)
Total
Nodes
AvgTree
Depth
Train
Spd %
Pred
Spd %
VF10 91.53 909 31.8 675 1009 22 16 72
VF100 90.97 973 5.99 481 704 24 17 73
VF1000 90.97 951 4.22 412 604 27 17 73
BT 90.48 808 3.68 373 540 22 15 73
GK100 89.96 961 6.89 530 777 34 17 73
GK1000 90.94 937 2.66 403 581 27 16 75
GAUSS10
91.35 874 93.7 683 1166 24 15 69
GAUSS100
90.91 853 92.6 639 1167 50 14 69
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Server Environment (400MB memory limit)
Method % correct
Train
(million)
Active
Leaves
Inactive
(hdrds)
Total
Nodes
AvgTree
Depth
Train
Spd %
Pred
Spd %
VF10 91.41 293 320 80.4 591 24 4 74
VF100 91.19 142 73.9 143 316 23 4 75
VF1000 91.12 108 19 127 206 22 3 79
BT 90.50 60 13.7 92.9 147 19 2 81
GK100 89.88 158 84 145 346 32 4 75
GK1000 91.03 91 17.6 122 197 21 3 80
GAUSS10
91.21 518 540 26.8 891 28 6 73
GAUSS100
90.75 538 566 38.7 998 63 6 66
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Overall results - comments
• VFML10 is superior on average in all environments, followed closely by GAUSS10
• GK methods are generally competitive• BINTREE is only competitive in a server setting• Default setting of 1000 for VFML is a poor choice• Crude binning provides more space which leads to
faster growth and better trees (more room to grow)• Higher values for GAUSS leads to very deep trees
(in excess of the # of attributes) suggesting repeated splitting (too fine grained)
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Remarks – sensor network environment
• Number of training examples low because learning stops when last active leaf is deactivated (mem mgmt freezes nodes – low # examples, low probability of splitting)
• Most accurate methods VFML10, GAUSS10
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Remarks – Handheld Environment
• Generates smaller trees (than server) and can therefore process more examples
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Remarks – Server Environment
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
VFML10 vs GAUSS10 – Closer Analysis
• Recall VFML10 is superior on average• Sensor (avg 87.7 vs 86.2)
– GAUSS10 superior on 10 – VFML10 superior on 6 (2 no difference)
• Handheld (avg 91.5 vs 91.4)– GAUSS10 superior on 4– VFML10 superior on 8 (6 no difference)
• Server (avg 91.4 vs 91.2)– GAUSS10 superior on 6– VFML10 superior on 6 (6 no difference)
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Data order
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
Conclusion
• We have presented a method for handling numeric attributes in data streams that performs well in empirical studies
• The methods employing the most approximation were superior – they allow greater growth when memory is limited.
• On a dataset by dataset analysis there is not much to choose between VFML10 and GAUSS10
• Gains made in handling numeric variables come at a cost in terms of training and prediction speed – the cost is high in some environments
Handling Numeric Attributes in Hoeffding trees Pfahringer, Holmes and Kirkby - Machine Learning Group
All algorithms available
• https://sourceforge.net/projects/moa-datastream• All methods and an environment
for experimental evaluation of
data streams is available from the
above URL – system is called
Massive
Online
Analysis (MOA)