http://www.isis.ecs.soton.ac.uk
Identifying Feature Relevance Identifying Feature Relevance Using a Random ForestUsing a Random Forest
Jeremy Rogers & Steve Gunn
OverviewOverview
What is a Random Forest?
Why do Relevance Identification?
Estimating Feature Importance with a Random Forest
Node Complexity Compensation
Employing Feature Relevance
Extension to Feature Selection
Random ForestRandom Forest
Combination of base learners using Bagging
Uses CART-based decision trees
Random Forest (cont...)Random Forest (cont...)
Optimises split using Information Gain
Selects feature randomly to perform each split
Implicit Feature Selection of CART is removed
Feature Relevance: RankingFeature Relevance: Ranking
Analyse Features individually
Measures of Correlation to the target
Feature is relevant if:
)()|( yYPxXyYP ii
Assumes no feature interaction
Fails to identify relevant features in parity problem
Feature Relevance: Subset MethodsFeature Relevance: Subset Methods
Use implicit feature selection of decision tree inductionWrapper methodsSubset search methodsIdentifying Markov BlanketsFeature is relevant if:
)|(),|( iiiiii sSyYPsSxXyYP
Relevance Identification using Relevance Identification using Average Information GainAverage Information Gain
Can identify feature interactionReliability dependant upon node compositionIrrelevant features give non-zero relevance
Node Complexity CompensationNode Complexity Compensation
Some nodes are easier to split
Requires each sample to be weighted by some measure of node complexity
Data projected on to one-dimensional space
For Binary Classification:
tsarrangemen possible No. -
examples positive No. -
nodein examples No. -
niC
i
n
Unique & Non-Unique ArrangementsUnique & Non-Unique Arrangements
Some arrangements are reflections (non-unique)
Some arrangements are symmetrical about their centre (unique)
Node Complexity Compensation Node Complexity Compensation (cont…)(cont…)
2/2/niC
ni
Uni
ni
ni
Uni
ni
ni
U
C
AC
CC
AC
CC
AlNC
1log
2log
1log)(
2
2
2
n i Au
O O
O E
E O 0
E E
2/)1(2/n
iC
2/)1(2/)1(
niC
Au - No. Unique Arrangements
Information Gain Density FunctionsInformation Gain Density Functions
Node Complexity improves measure of average IG
The effect is visible when examining the IG density functions for each feature
These are constructed by building a forest and recording the frequencies of IG values achieved by each feature
Information Gain Density FunctionsInformation Gain Density Functions
0 0.5 10
1
2
3
4
5
6
De
nsity
0 0.5 10
2
4
6
8
10
0 0.5 10
2
4
6
8
0 0.5 10
2
4
6
8
10
12
De
nsity
Information Gain
0 0.5 10
2
4
6
8
Information Gain
0 0.5 10
5
10
15
20
25
30
Information Gain
• RF used to construct 500 trees on an artificial dataset
• IG density functions recorded for each feature
Employing Feature RelevanceEmploying Feature Relevance
Feature Selection
Feature Weighting
Random Forest uses a Feature Sampling distribution to select each feature.
Distribution can be altered in two waysParallel: Update during forest construction
Two-stage: Fixed prior to forest construction
ParallelParallelControl update rate using confidence intervals.
Assume Information Gain values have normal distribution.
S
nX )( Statistic has a Student’s t distribution with n-1 degrees of freedom
n
SqX
n
SqXP 12
Maintain most uniform distribution within confidence bounds
Convergence RatesConvergence Rates
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of Trees
Ave
rag
e C
on
fide
nce
Inte
rva
l Siz
e
WBCVotesIonosphereFriedmanPimaSonarSimple
No. Features
Av. Tree Size
WBC 9 58.3
Votes 16 40.4
Ionosphere 34 70.9
Friedman 10 77.2
Pima 8 269.3
Sonar 60 81.6
Simple 9 57.3
ResultsResults
Data Set RF CI 2S CART
WBC 0.0226 0.0259 0.0226
Sonar 0.1657 0.1462 0.1710
Votes 0.0650 0.0493 0.0432
Pima 0.2343 0.2394 0.2474
Ionosphere 0.0725 0.0681 0.0661
Friedman 0.1865 0.1690 0.1490
Simple 0.0937 0.0450 0.0270
• 90% of data used for training, 10% for testing
• Forests of 100 trees were tested and averaged over 100 trials
Irrelevant FeaturesIrrelevant Features
Average IG is the mean of a non-negative sample.
Expected IG of an irrelevant feature is non-zero.
Performance is degraded when there is a high proportion of irrelevant features.
Expected Information GainExpected Information Gain
L
LL
L
LL
L
L
L
LL
L
LL
L
LL
L
L
L
LL
in
nn
iinn
nn
iinn
nn
ii
nn
ii
n
nn
n
in
n
in
n
i
n
i
n
n
PEIGLL
22
22
,
loglog
loglog
min
nL - No. examples in left descendant
iL - No. positive examples in left descendant
Bounds on Expected Information Bounds on Expected Information GainGain
Upper can be approximated as
82.0
2
n
IGE U
Lower Bound is given by
n
n
n
n
nIGE L
1log
112
Irrelevant Features: BoundsIrrelevant Features: Bounds
1 2 3 4 5 6 7 8 90
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Feature
Ave
rag
e In
form
atio
n G
ain
• 100 trees built on artificial dataset
• Average IG recorded and bounds calculated
FriedmanFriedman
1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Feature
Frequency S
ele
cted
1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Feature
Frequency S
ele
cted
FS:
CFS:
0.1,05102
120)sin(10 54
2
321 NXXXXXY
SimpleSimple
1 2 3 4 5 6 7 8 90
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 90
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FS:
CFS:
22 21
XXY
ResultsResults
Data Set CFS FW FS FW & FS
WBC 0.0235 0.0249 0.0245 0.0249
Sonar 0.2271 0.1757 0.1629 0.1643
Votes 0.0398 0.0464 0.0650 0.0439
Pima 0.2523 0.2312 0.2492 0.2486
Ionosphere 0.0650 0.0683 0.0747 0.0653
Friedman 0.1685 0.1555 0.1420 0.1370
Simple 0.1653 0.0393 0.0283 0.0303
• 90% of data used for training, 10% for testing
• Forests of 100 trees were tested and averaged over 100 trials
• 100 trees constructed for feature evaluation in each trial