learning with trees
DESCRIPTION
Learning with Trees. Rob Nowak. University of Wisconsin-Madison Collaborators: Rui Castro, Clay Scott, Rebecca Willett. www.ece.wisc.edu/~nowak. Artwork: Piet Mondrian. Basic Problem: Partitioning. - PowerPoint PPT PresentationTRANSCRIPT
Learning with Trees
University of Wisconsin-MadisonCollaborators: Rui Castro, Clay Scott, Rebecca Willett
Rob Nowak
Artwork: Piet Mondrian www.ece.wisc.edu/~nowak
Basic Problem: Partitioning
Many problems in statistical learning theory boil down to finding a good partition
function partition
Classification
Learning and Classification: build a decision rule based on labeled training data
Labeled trainingfeatures
Classification rule:partition of feature space
MRI data brain aneurysm
Signal and Image Processing
Recover complex geometrical structure from noisy data
Extracted vascular network
Partitioning Schemes
Support Vector Machine
image partitions
Why Trees ?
CART: Breiman, Friedman, Olshen, and Stone, 1984Classification and Regression Trees
C4.5: Quinlan 1993, C4.5: Programs for Machine Learning
• Simplicity of design
• Interpretability
• Ease of implementation
• Good performance in practice
Trees are one of the most popular and widely used machine learning / data analysis tools
JPEG 2000: Image compression standard, 2000 http://www.jpeg.org/jpeg2000/
Example: Gamma-Ray Burst Analysis
One burst (10’s of seconds) emits as much energy as our entire MilkyWay does in one hundred years !
x-ray “after glow”
time
photoncounts
Compton Gamma-Ray ObservatoryBurst and Transient Source Experiment (BATSE)
burst
coarse partition
Trees and Partitions
fine partition
Estimation using Pruned Tree
Each leaf corresponds to a sample f(ti ), i=0,…,N-1piecewise constant fits to data on each piece of the partition provides a good estimate
piecewise linearfit on each cell
piecewise polynomialfit on each cell
Gamma-Ray Burst 845
Recursive Partitions
Adapted Partition
Image Denoising
Decision (Classification) Trees
labeled training data Bayes decision boundary complete partition pruned partition
decision tree- majority vote at each leaf
Classification
Ideal classfier Adapted partition histogram
256 cells ineach partition
Image Partitions
1024 cells in each partition
Image Coding
JPEG 0.125 bpp JPEG 2000 0.125 bppnon-adaptive partitioning adaptive partitioning
Probabilistic Framework
Prediction Problem
Challenge
Empirical Risk
Empirical Risk Minimization
Classification and Regression Trees
Classification and Regression Trees
1
0 0 00
1
11
11
00 0
0
11
1 0
0
Empirical Risk Minimization on Trees
Overfitting Problemstable
variable
crude
accurate
Bias/Variance Trade-off
fine partition
coarse partition
small variance
large variancesmall bias
large bias
Estimation and Approximation Error
Estimation Error in Regression
Estimation Error in Classification
Partition Complexity and Overfitting
empirical risk
variance
# leaves
trust no trust
overfitting to data
risk
Controlling Overfitting
Complexity Regularization
Per-Cell Variance Bounds: Regression
Per-Cell Variance Bounds: Classification
Variance Bounds
A Slightly Weaker Variance Bound
Complexity Regularization
“small” leafscontribute very little to penalty
Example: Image Denoising
This is special case of “wavelet denoising” using Haar wavelet basis
Theory of Complexity Regularization
Classification
eyesm
usta
che
Probabilistic Framework
Learning from Data
0
1
0
1
Approximation and Estimation
0
1
Approximation
Model selection
BIAS
VARIANCE
Classifier Approximations
0
1
Approximation Error
Symmetric difference set
Error
Approximation Error
boundary smoothness
risk functional (transition) smoothness
Boundary Smoothness
Transition Smoothness
Transition Smoothness
Fundamental Limit to Learning
Mammen & Tsybakov (1999)
Related Work
Box-Counting Class
Box-Counting Sub-Classes
Dyadic Decision Trees
labeled training data Bayes decision boundary complete RDP pruned RDP
Dyadic decision tree- majority vote at each leaf
Joint work with Clay Scott, 2004
Dyadic Decision Trees
The Classifier Learning Problem
Problem:
Training Data:
Model Class:
Empirical Risk
Chernoff’s Bound
Chernoff’s Bound
actual risk is probably not much larger than empirical risk
Error Deviation Bounds
Uniform Deviation Bound
Setting Penalties
Setting Penalties
prefix codes for trees:
0
100
0 01 1 1 1
1
code: 0001001111+ 6 bits for leaf labels
Uniform Deviation Bound
Decision Tree Selection
Compare with :
Oracle Bound:
ApproximationError (Bias)
EstimationError (Variance)
Rate of Convergence
BUT…
Why too slow ?
same number of leafssame number of leafs
Balanced vs. Unbalanced Trees
all |T| leaf trees are equally favored
Spatial Adaptation
local empirical local empirical errorerror
local errorlocal error
Relative Chernoff Bound
Designing Leaf Penalties
01 = “right branch”
00
01
11 = “terminate”
0/1 = “label”
010001110
prefix code construction :
Uniform Deviation Bound
Compare with :
Spatial Adaptivity
Key: local complexity is offset by small volumes!
Bound Comparison for Unbalanced Tree
J leafsdepth J-1
Non-adaptive bound:
Adaptive bound:
same number of leafssame number of leafs
Balanced vs. Unbalanced Trees
Decision Tree Selection
Oracle Bound:
ApproximationError
EstimationError
Rate of Convergence
Computable Penalty
achieves same rate of convergence
Adapting to Dimension - Feature Rejection
01
Adapting to Dimension - Data Manifold
Cyclic DDT: force coordinate splits in cyclic order
Free-Split DDT: no order enforcement in splits
Computational Issues additive penalty
DDTs in Action
Comparison to State-of-Art
Best results: (1) AdaBoost with RBF-Network, (2) Kernel Fisher Discriminant, (3) SVM with RBF-Kernel,
ODCT = DDT + cross-validation
Elevation Map St. Louis
Noisy data
Level set
Thresholded data Spatially adapt. penalty
Penalty proportional to |T|
Application to Level Set Estimation
Conclusions and Future Work
Open Problem:
www.ece.wisc.edu/~nowakMore Info:
www.ece.wisc.edu/~nowak/ece901