automatic feature selection feb 2015. update on hadoop / r try hortonworks sandbox get a vm player...

19
Automatic Feature Selection Feb 2015

Upload: matteo-hakey

Post on 15-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)

Automatic Feature Selection

Feb 2015

Page 2: Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)

Update on Hadoop / R

Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from

HortonWorks) http://hortonworks.com/products/hortonworks-sand

box/#install

Do tutorials – here http://hortonworks.com/tutorials/

Add R / Rstudio Server to your VM

Use Rhadoop to inteface Hadoop and R

Page 3: Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)

Issue

There are many predictive analytical

models that will work –Which among many

is best?

Page 4: Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)

Example Data – HVAC building log data

date 6/1/13 6/1/13 6/1/13 6/25/13time 0:00:01 0:00:01 0:00:01 0:13:19target.temp 69 66 69 70actual.temp 55 58 60 71system 14 13 5 19system.age 6 20 8 14building.id 17 4 7 18temp.diff 14 8 9 -1temp.range COLD COLD COLD NORMALextreme.temp 1 1 1 0country Egypt Finland South Africa Indonesiahvac.product FN39TG GG1919 FN39TG JDNS77building.age 11 17 13 25building.manager M17 M4 M7 M18service.center.distance 150 115 100 68days.since.service 142 109 164 86he.efficiency 12 22 2 36fan.hours 17 16 15 8coolant.type B12 B12 B12 B12software.release P10 P10 P10 P10ave.outside.temp 91 46 77 80software.P12 0 0 0 0coolant.B12 1 1 1 1neg.diff 1 1 1 -1abs.diff 14 8 9 1diff.size 3 2 2 1cut.off 1 1 1 0

Page 5: Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)

What to look for in among models

R-squared (linear models)

Variable Significance

# of Variables that are significant

Sign of Variables

Confusion Matrix “Score” (non-linear models)

AIC number (non-linear models)

Page 6: Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)

What to look for in among models

Variables and Significance

AIC Score

Confusion Matrix

Confusion Matrix Score

Page 7: Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)

Hand Done Model Outcome

Page 8: Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)

Approach

Calculate the combinations of all independent variables

Write function to; Run each model possibility For a sample of X (~10) samples of training / test data

sets Collect;

# of variables that have significance < .1 “score” the confusion matrix

Multiple # of significant of variables by confusion matrix score, average over sampling range, sort results data frame

Page 9: Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)

Step 1 – set up empty data frame to hold results

Page 10: Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)

Step 2 – calculate all combinations of variables

Page 11: Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)

Step 3 – run function to estimate all models and save parameters

Page 12: Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)

Step 4 – average all models and sort

Page 13: Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)

Average of Top Models Are …

Model MatrixMean SigMean Weigthed

cut.off ~ + system + building.id + hvac.product + building.age + building.manager + coolant.type + software.P12 0.79 5.60 4.45

cut.off ~ + system.age + building.id + hvac.product + building.age + building.manager + he.efficiency + coolant.type 0.88 5.00 4.39

cut.off ~ + building.id + hvac.product + building.age + building.manager + coolant.type + software.release + ave.outside.temp 0.85 4.90 4.17

cut.off ~ + system + building.id + hvac.product + building.manager + service.center.distance + coolant.type + ave.outside.temp 0.77 4.30 3.30

cut.off ~ + building.id + service.center.distance + days.since.service + fan.hours + coolant.type + software.release + software.P12 0.91 3.60 3.28

cut.off ~ + system + system.age + building.id + days.since.service + fan.hours + ave.outside.temp + software.P12 0.86 3.80 3.25

cut.off ~ + system + system.age + building.id + building.age + days.since.service + fan.hours + software.P12 0.84 3.80 3.18

cut.off ~ + building.id + country + building.manager + service.center.distance + days.since.service + fan.hours + coolant.type 0.88 3.60 3.17

cut.off ~ + system.age + building.id + country + building.manager + service.center.distance + coolant.type + software.P12 0.87 3.60 3.14

cut.off ~ + system.age + building.id + country + building.manager + service.center.distance + coolant.type + software.release 0.85 3.70 3.14

cut.off ~ + building.id + hvac.product + building.age + building.manager + service.center.distance + coolant.type + software.P12 0.89 3.50 3.11

cut.off ~ + building.id + hvac.product + building.age + building.manager + service.center.distance + coolant.type + ave.outside.temp 0.89 3.50 3.10

cut.off ~ + building.id + building.age + building.manager + service.center.distance + days.since.service + he.efficiency + coolant.type 0.88 3.50 3.09

cut.off ~ + building.id + country + building.manager + days.since.service + coolant.type + ave.outside.temp + software.P12 0.85 3.60 3.06

cut.off ~ + building.id + hvac.product + building.age + fan.hours + software.release + ave.outside.temp + software.P12 0.81 3.70 3.00

cut.off ~ + hvac.product + building.age + days.since.service + he.efficiency + coolant.type + ave.outside.temp + software.P12 0.91 3.30 3.00

Page 14: Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)

Each of these should be tested again

More extensive use of varied train / test data sample sets

Stability of each model beyond the scoring

Chosen model “makes sense”

Page 15: Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)

Alternative ways to do this …

Caret Package function “rfe” (recursive feature elimination) Try all variables first Train and Test the model with cross-validation Calculate the most important variables Eliminate the least important variables Train and Test the model again Calculate the most important variables Eliminate the least important variables Repeat …..

Page 16: Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)

Setting it up & running RFE

data frame of predictor variables

vector of outcome variable

max number of variables to keep

control functions

run recursive elimination model

Page 17: Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)

Outcome of the RFE

Page 18: Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)

Problems

Number of variables combinations can get HUGE

Might need multicore or parallel to get through it

Page 19: Automatic Feature Selection Feb 2015. Update on Hadoop / R  Try HortonWorks Sandbox  Get a VM player  Download and install OVA (VM file from HortonWorks)

Thank YouBrooke Aker

[email protected]