automatic feature selection feb 2015. update on hadoop / r try hortonworks sandbox get a vm player...
TRANSCRIPT
Automatic Feature Selection
Feb 2015
Update on Hadoop / R
Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from
HortonWorks) http://hortonworks.com/products/hortonworks-sand
box/#install
Do tutorials – here http://hortonworks.com/tutorials/
Add R / Rstudio Server to your VM
Use Rhadoop to inteface Hadoop and R
Issue
There are many predictive analytical
models that will work –Which among many
is best?
Example Data – HVAC building log data
date 6/1/13 6/1/13 6/1/13 6/25/13time 0:00:01 0:00:01 0:00:01 0:13:19target.temp 69 66 69 70actual.temp 55 58 60 71system 14 13 5 19system.age 6 20 8 14building.id 17 4 7 18temp.diff 14 8 9 -1temp.range COLD COLD COLD NORMALextreme.temp 1 1 1 0country Egypt Finland South Africa Indonesiahvac.product FN39TG GG1919 FN39TG JDNS77building.age 11 17 13 25building.manager M17 M4 M7 M18service.center.distance 150 115 100 68days.since.service 142 109 164 86he.efficiency 12 22 2 36fan.hours 17 16 15 8coolant.type B12 B12 B12 B12software.release P10 P10 P10 P10ave.outside.temp 91 46 77 80software.P12 0 0 0 0coolant.B12 1 1 1 1neg.diff 1 1 1 -1abs.diff 14 8 9 1diff.size 3 2 2 1cut.off 1 1 1 0
What to look for in among models
R-squared (linear models)
Variable Significance
# of Variables that are significant
Sign of Variables
Confusion Matrix “Score” (non-linear models)
AIC number (non-linear models)
What to look for in among models
Variables and Significance
AIC Score
Confusion Matrix
Confusion Matrix Score
Hand Done Model Outcome
Approach
Calculate the combinations of all independent variables
Write function to; Run each model possibility For a sample of X (~10) samples of training / test data
sets Collect;
# of variables that have significance < .1 “score” the confusion matrix
Multiple # of significant of variables by confusion matrix score, average over sampling range, sort results data frame
Step 1 – set up empty data frame to hold results
Step 2 – calculate all combinations of variables
Step 3 – run function to estimate all models and save parameters
Step 4 – average all models and sort
Average of Top Models Are …
Model MatrixMean SigMean Weigthed
cut.off ~ + system + building.id + hvac.product + building.age + building.manager + coolant.type + software.P12 0.79 5.60 4.45
cut.off ~ + system.age + building.id + hvac.product + building.age + building.manager + he.efficiency + coolant.type 0.88 5.00 4.39
cut.off ~ + building.id + hvac.product + building.age + building.manager + coolant.type + software.release + ave.outside.temp 0.85 4.90 4.17
cut.off ~ + system + building.id + hvac.product + building.manager + service.center.distance + coolant.type + ave.outside.temp 0.77 4.30 3.30
cut.off ~ + building.id + service.center.distance + days.since.service + fan.hours + coolant.type + software.release + software.P12 0.91 3.60 3.28
cut.off ~ + system + system.age + building.id + days.since.service + fan.hours + ave.outside.temp + software.P12 0.86 3.80 3.25
cut.off ~ + system + system.age + building.id + building.age + days.since.service + fan.hours + software.P12 0.84 3.80 3.18
cut.off ~ + building.id + country + building.manager + service.center.distance + days.since.service + fan.hours + coolant.type 0.88 3.60 3.17
cut.off ~ + system.age + building.id + country + building.manager + service.center.distance + coolant.type + software.P12 0.87 3.60 3.14
cut.off ~ + system.age + building.id + country + building.manager + service.center.distance + coolant.type + software.release 0.85 3.70 3.14
cut.off ~ + building.id + hvac.product + building.age + building.manager + service.center.distance + coolant.type + software.P12 0.89 3.50 3.11
cut.off ~ + building.id + hvac.product + building.age + building.manager + service.center.distance + coolant.type + ave.outside.temp 0.89 3.50 3.10
cut.off ~ + building.id + building.age + building.manager + service.center.distance + days.since.service + he.efficiency + coolant.type 0.88 3.50 3.09
cut.off ~ + building.id + country + building.manager + days.since.service + coolant.type + ave.outside.temp + software.P12 0.85 3.60 3.06
cut.off ~ + building.id + hvac.product + building.age + fan.hours + software.release + ave.outside.temp + software.P12 0.81 3.70 3.00
cut.off ~ + hvac.product + building.age + days.since.service + he.efficiency + coolant.type + ave.outside.temp + software.P12 0.91 3.30 3.00
Each of these should be tested again
More extensive use of varied train / test data sample sets
Stability of each model beyond the scoring
Chosen model “makes sense”
Alternative ways to do this …
Caret Package function “rfe” (recursive feature elimination) Try all variables first Train and Test the model with cross-validation Calculate the most important variables Eliminate the least important variables Train and Test the model again Calculate the most important variables Eliminate the least important variables Repeat …..
Setting it up & running RFE
data frame of predictor variables
vector of outcome variable
max number of variables to keep
control functions
run recursive elimination model
Outcome of the RFE
Problems
Number of variables combinations can get HUGE
Might need multicore or parallel to get through it
Thank YouBrooke Aker