learning with trees

Learning with Trees

University of Wisconsin-MadisonCollaborators: Rui Castro, Clay Scott, Rebecca Willett

Rob Nowak

Artwork: Piet Mondrian www.ece.wisc.edu/~nowak

Basic Problem: Partitioning

Many problems in statistical learning theory boil down to finding a good partition

function partition

Classification

Learning and Classification: build a decision rule based on labeled training data

Labeled trainingfeatures

Classification rule:partition of feature space

MRI data brain aneurysm

Signal and Image Processing

Recover complex geometrical structure from noisy data

Extracted vascular network

Partitioning Schemes

Support Vector Machine

image partitions

Why Trees ?

CART: Breiman, Friedman, Olshen, and Stone, 1984Classification and Regression Trees

C4.5: Quinlan 1993, C4.5: Programs for Machine Learning

• Simplicity of design

• Interpretability

• Ease of implementation

• Good performance in practice

Trees are one of the most popular and widely used machine learning / data analysis tools

JPEG 2000: Image compression standard, 2000 http://www.jpeg.org/jpeg2000/

Example: Gamma-Ray Burst Analysis

One burst (10’s of seconds) emits as much energy as our entire MilkyWay does in one hundred years !

x-ray “after glow”

time

photoncounts

Compton Gamma-Ray ObservatoryBurst and Transient Source Experiment (BATSE)

burst

coarse partition

Trees and Partitions

fine partition

Estimation using Pruned Tree

Each leaf corresponds to a sample f(ti ), i=0,…,N-1piecewise constant fits to data on each piece of the partition provides a good estimate

piecewise linearfit on each cell

piecewise polynomialfit on each cell

Gamma-Ray Burst 845

Recursive Partitions

Adapted Partition

Image Denoising

Decision (Classification) Trees

labeled training data Bayes decision boundary complete partition pruned partition

decision tree- majority vote at each leaf

Classification

Ideal classfier Adapted partition histogram

256 cells ineach partition

Image Partitions

1024 cells in each partition

Image Coding

JPEG 0.125 bpp JPEG 2000 0.125 bppnon-adaptive partitioning adaptive partitioning

Probabilistic Framework

Prediction Problem

Challenge

Empirical Risk

Empirical Risk Minimization

Classification and Regression Trees

Classification and Regression Trees

1

0 0 00

1

11

11

00 0

0

11

1 0

0

Empirical Risk Minimization on Trees

Overfitting Problemstable

variable

crude

accurate

Bias/Variance Trade-off

fine partition

coarse partition

small variance

large variancesmall bias

large bias

Estimation and Approximation Error

Estimation Error in Regression

Estimation Error in Classification

Partition Complexity and Overfitting

empirical risk

variance

# leaves

trust no trust

overfitting to data

risk

Controlling Overfitting

Complexity Regularization

Per-Cell Variance Bounds: Regression

Per-Cell Variance Bounds: Classification

Variance Bounds

A Slightly Weaker Variance Bound

Complexity Regularization

“small” leafscontribute very little to penalty

Example: Image Denoising

This is special case of “wavelet denoising” using Haar wavelet basis

Theory of Complexity Regularization

Classification

eyesm

usta

che

Probabilistic Framework

Learning from Data

0

1

0

1

Approximation and Estimation

0

1

Approximation

Model selection

BIAS

VARIANCE

Classifier Approximations

0

1

Approximation Error

Symmetric difference set

Error

Approximation Error

boundary smoothness

risk functional (transition) smoothness

Boundary Smoothness

Transition Smoothness

Fundamental Limit to Learning

Mammen & Tsybakov (1999)

Related Work

Box-Counting Class

Box-Counting Sub-Classes

Dyadic Decision Trees

labeled training data Bayes decision boundary complete RDP pruned RDP

Dyadic decision tree- majority vote at each leaf

Joint work with Clay Scott, 2004

Dyadic Decision Trees

The Classifier Learning Problem

Problem:

Training Data:

Model Class:

Empirical Risk

Chernoff’s Bound

Chernoff’s Bound

actual risk is probably not much larger than empirical risk

Error Deviation Bounds

Uniform Deviation Bound

Setting Penalties

Setting Penalties

prefix codes for trees:

0

100

0 01 1 1 1

1

code: 0001001111+ 6 bits for leaf labels

Decision Tree Selection

Compare with :

Oracle Bound:

ApproximationError (Bias)

EstimationError (Variance)

Rate of Convergence

BUT…

Why too slow ?

same number of leafssame number of leafs

Balanced vs. Unbalanced Trees

all |T| leaf trees are equally favored

Spatial Adaptation

local empirical local empirical errorerror

local errorlocal error

Relative Chernoff Bound

Designing Leaf Penalties

01 = “right branch”

00

01

11 = “terminate”

0/1 = “label”

010001110

prefix code construction :


Compare with :

Spatial Adaptivity

Key: local complexity is offset by small volumes!

Bound Comparison for Unbalanced Tree

J leafsdepth J-1

Non-adaptive bound:

Adaptive bound:

same number of leafssame number of leafs

Balanced vs. Unbalanced Trees

Decision Tree Selection

Oracle Bound:

ApproximationError

EstimationError

Rate of Convergence

Computable Penalty

achieves same rate of convergence

Adapting to Dimension - Feature Rejection

01

Adapting to Dimension - Data Manifold

Cyclic DDT: force coordinate splits in cyclic order

Free-Split DDT: no order enforcement in splits

Computational Issues additive penalty

DDTs in Action

Comparison to State-of-Art

Best results: (1) AdaBoost with RBF-Network, (2) Kernel Fisher Discriminant, (3) SVM with RBF-Kernel,

ODCT = DDT + cross-validation

Elevation Map St. Louis

Noisy data

Level set

Thresholded data Spatially adapt. penalty

Penalty proportional to |T|

Application to Level Set Estimation

Conclusions and Future Work

Open Problem:

www.ece.wisc.edu/~nowakMore Info:

www.ece.wisc.edu/~nowak/ece901

learning with trees

Documents

noisy data

decision rule

cellgammaray burst

gammaray burst analysisone

image denoisingthis

overfittingempirical

regression treesclassification

partitionimage codingjpeg