modeling)with)rules) - ima.umn.edu · dyspepsia&)epigastric)pain) )heartburn)...

Post on 14-May-2019

220 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Modeling  with  Rules  

Cynthia  Rudin  Assistant  Professor  of  Sta8s8cs  

Massachuse:s  Ins8tute  of  Technology  joint  work  with:    

David  Madigan  (Columbia)    Allison  Chang,  Ben  Letham  (MIT  PhD  students)  

Dimitris  Bertsimas  (MIT)    Tyler  McCormick  (UW)  

Gene  Kogan  (Independent)    

Would  like  predic8ve  models  that  are  both  accurate  and  interpretable.  

Accuracy  =  classifica8on  accuracy  Interpretability  =  ?  

Would  like  predic8ve  models  that  are  both  accurate  and  interpretable.  

Accuracy  =  classifica8on  accuracy  Interpretability  =    

concise  -­‐  model  is  small  convincing  -­‐  there  are  reasons  behind  each  predic8on  

Decision  List  

fenway  park=1   1                    97/100  8mes  

rush_hour=0   -­‐1                  474/523  8mes                      

rain=0,    construc8on=0   -­‐1                  329/482  8mes  

Friday=1   -­‐1                  3/3  8mes  rain=1   1                  452/892  8mes  

Traffic  jam    in  Boston?  

Modeling  with  Rules  

otherwise   -­‐1                  10/15  8mes  

Modeling  with  Rules  Dichotomy  in  the  State  of  the  Art  

Accuracy  

vs.  

Interpretability  

Decision  Trees  

Support  Vector  Machines  Boosted  Decision  Trees  

Modeling  with  Rules  Daydreaming  

•   Nice  if  the  whole  algorithm  were  interpretable                                                                          OR  •   Want  the  accuracy  of  SVM/Boosted  DT  and  the  interpretability  of  Decision  Trees.  

•  Part 1: Humans can interpret the predictions, and understand the full algorithm

•  Part 2: Bayesian hierarchical modeling with rules

•  Part 3: Accurate rule classifiers using MIO

 Sequen8al  Event  Predic8on  with  Associa8on  Rules    (R,  Letham,  Aouissi,  Kogan,  Madigan)  -­‐  COLT  2011  

A  Hierarchical  Model  for  Associa8on  Rule  Mining  of  Sequen8al  Events:  An  Approach  to  Automated  Medical  Symptom  Predic8on.      (McCormick,  R,  Madigan)  –  Annals  of  Applied  Sta8s8cs,  forthcoming  2012  

Ordered  Rules  for  Classifica8on:  A  Discrete  Op8miza8on  Approach  to  Associa8ve  Classifica8on    (Bertsimas,  Chang,  R)  –  In  progress  

Outline

 Associa8on  Rule  Mining:  (Agrawal;  Imielinski;  Swami,  1993)  &  (Agrawal  and  Srikant,  1994)  

Construc8on=1  

Rain=1  

Traffic=1  

15  8mes  we  saw  construc8on  and  rain,  and    13  out  of  15  of  those  8mes  we  also  saw  traffic  

Supp(construction=1 & rain=1) = 15Supp(traffic=1 & construction=1 & rain=1) = 13

Conf (contruction=1 & rain=1 → traffic=1) = 13 /15

“Max  Confidence,  Min  Support”  Algorithm  Step  1.  Find  all  rules                          ,  where    

Step  2.  Rank  rules  in  descending  order  of                                                  recommend  the  right  hand  side  of  the  first  rule  that  applies.  

                 15                                            13/15=.867  

                 25                                              20/25=.8  

                 17                                              12/17=.706  

                 50                                                34/50=.68  

a→ b Supp(a) ≥θ .Conf (a→ b),

Conf (a→ b),Supp(a)

1  

rush  hour=0   -­‐1  

Friday=1   -­‐1  

otherwise   -­‐1  

15  8mes  we  saw  construc8on  and  rain,  and    13  out  of  15  of  those  8mes  we  also  saw  traffic  

Conf=.99,  Supp=10000  vs.  Conf=1,  Supp=10  

Supp(construction=1 & rain=1) = 15Supp(traffic=1 & construction=1 & rain=1) = 13

Conf (contruction=1 & rain=1 → traffic=1) = 13 /15

Bayesian  version  of  the  confidence      

AdjustedConf (a→ b) := Supp(a&b)Supp(a)+ K

“Adjusted  Confidence”  Algorithm  Step  1 Find  all  rules                          .  

Step  2.  Rank  rules  in  descending  order  of                                                                                      recommend  the  right  hand  side  of  the  first  rule  that  applies.  

                 25                                            20/(25+5)=.67  

                 15                                            13/(15+5)=.65  

                 50                                            34/(50+5)=.62  

                 17                                            12/(17+5)=.55  

a→ bAdjustedConf (a→ b),

AdjustedConf (a→ b), K = 5Supp(a)

rush  hour=0   -­‐1  

Friday=1   0  

otherwise   -­‐1  

1  

•  Rare  rules  can  be  used  

•  Among  rules  with  similar  confidence,  prefers  rules  with  higher  support    

•  K  encourages  larger  support,  helps  with  predic8on  

Conf=.99,  Support=10000  vs.  Conf=1,  Support=10  

AdjustedConf (a→ b) := Supp(a&b)Supp(a)+ K

– Humans can understand the prediction, and the algorithm

– Good for sequential event problems, where a set of events happen in a particular order •  e.g., for predicting what a customer will put next into an online

shopping cart, or for predicting medical symptoms in a sequence

– Having larger K helps with generalization •  algorithmic stability (pointwise hypothesis stability) •  other learning theoretic implications

–  Performs better empirically than the Max-Conf Min-Support Classifiers in our experiments

A  Learning  Theory  Framework  for  Associa8on  Rules  and  Sequen8al  Events  (R,  Letham,  Kogan,  Madigan)  –  SSRN  2011    

Sequen8al  Event  Predic8on  with  Associa8on  Rules    (R,  Letham,  Aouissi,  Kogan,  Madigan)  -­‐  COLT  2011  

•  Part 1: Humans can interpret the predictions, and understand the full algorithm

•  Part 2: Bayesian hierarchical modeling with rules

•  Part 3: Accurate rule classifiers using MIO

 Sequen8al  Event  Predic8on  with  Associa8on  Rules    (R,  Letham,  Aouissi,  Kogan,  Madigan)  -­‐  COLT  2011  

A  Hierarchical  Model  for  Associa8on  Rule  Mining  of  Sequen8al  Events:  An  Approach  to  Automated  Medical  Symptom  Predic8on.      (McCormick,  R,  Madigan)  –  Annals  of  Applied  Sta8s8cs,  forthcoming  2012  

Ordered  Rules  for  Classifica8on:  A  Discrete  Op8miza8on  Approach  to  Associa8ve  Classifica8on    (Bertsimas,  Chang,  R)  –  In  progress  

Outline

Recommender  Systems  for  Medical  Condi8ons  

Predic8on  based  on  your  medical  history:  

Input  medical  condi8on:  

Recommender  Systems  for  Medical  Condi8ons  

Predic8on  based  on  your  medical  history:  

Input  medical  condi8on:  

dyspepsia  &  epigastric  pain    heartburn  depression                                                high  blood  pressure  

Gastroesophageal  reflux                        high  blood  pressure  

heartburn  headache  dyspepsia  

fungal  infec8on  heartburn  

epigastric  pain  hypertension  dyspepsia  

Recommenda8ons  1.  rhini8s  2.  dyspepsia  3.  low  back  pain  

Recommenda8ons  1.  dyspepsia  2.  high  blood  pressure  3.  low  back  pain  

Recommenda8ons  1.  epigastric  pain  2.  heartburn  3.  high  blood  pressure  

t  

Medical  Condi8on  Predic8on  

Hierarchical  Associa8on  Rule  Model  (HARM)  

Hierarchical  Associa8on  Rule  Model  (HARM)  i patient index, r rule index of lhsr → rhsr

Hierarchical  Associa8on  Rule  Model  (HARM)  i patient index, r rule index of lhsr → rhsryir := Suppi (rhsr& lhsr )nir := Suppi (lhsr )

We'll model yir ~ Binomial(nir , pir )

shared  across  individuals  

Hierarchical  Associa8on  Rule  Model  (HARM)  i patient index, r rule index of lhsr → rhsryir := Suppi (rhsr& lhsr )nir := Suppi (lhsr )

We'll model yir ~ Binomial(nir , pir )pir ~ Beta(π ir ,τ i )

Hierarchical  Associa8on  Rule  Model  (HARM)  i patient index, r rule index of lhsr → rhsryir := Suppi (rhsr& lhsr )nir := Suppi (lhsr )

We'll model yir ~ Binomial(nir , pir )pir ~ Beta(π ir ,τ i )

Under this model, E(pir | yir ,nir ) =yir +π ir

nir +π ir +τ i.

Hierarchical  Associa8on  Rule  Model  (HARM)  i patient index, r rule index of lhsr → rhsryir := Suppi (rhsr& lhsr )nir := Suppi (lhsr )

We'll model yir ~ Binomial(nir , pir )pir ~ Beta(π ir ,τ i )π ir = exp(M 'i βr + γ i )

Hierarchical  Associa8on  Rule  Model  (HARM)  

π ir = exp(M 'i βr + γ i )

Hierarchical  Associa8on  Rule  Model  (HARM)  

π ir = exp(M 'i βr + γ i )

M∈ I×D (observable characteristics)

1   1  

Hierarchical  Associa8on  Rule  Model  (HARM)  

π ir = exp(M 'i βr + γ i )

M∈ I×D (observable characteristics)

1   1  

Example: π ir = exp(βr ,0 + βr ,11male + γ i ) = exp(βr ,11male )exp(βr ,0 + γ i )

Hierarchical  Associa8on  Rule  Model  (HARM)  i patient index, r rule index of lhsr → rhsryir := Suppi (rhsr& lhsr )nir := Suppi (lhsr )

We'll model yir ~ Binomial(nir , pir )pir ~ Beta(π ir ,τ i )π ir = exp(M 'i βr + γ i )

yir ~ Binomial(nir , pir )pir ~ Beta(π ir ,τ i )π ir = exp(M 'i βr + γ i )

Hierarchical  Associa8on  Rule  Model  (HARM)  

log(τ i ) ~ Normal(0,στ2 )

log(βrd ) ~ Normal(µβ ,σ β2 )

log(γ i ) ~ Normal(µγ ,σγ2 )

diffuse uniform priors on µβ ,σ β2 ,στ

2

HARM estimates posterior distribution (MCMC), then ranks rules by posterior mean.

Hierarchical  Associa8on  Rule  Model  (HARM)  

•  43,000  pa8ent  encounters  •  ~2,300  pa8ents,  age  (>  40)  •  pre-­‐exis8ng  condi8ons  dealt  with  separately  •  used  25  most  common  condi8ons,  and  25  least  common  condi8ons  

t  

training test

For trials=1:500 •  Form training and test sets:

–  sample ~200 patients –  for each patient, randomly split encounters into training and

test

t  

For trials=1:500 •  Form training and test sets:

–  sample ~200 patients –  for each patient, randomly split encounters into training and

test •  For each patient, iteratively make predictions on test encounters

–  get 1 point whenever our top 3 recommendations contain patient’s next condition

training test

●●

●●●● ●

●●

●●●●●

HAR

MC

onf.

Adj.

k=.2

5Ad

j. k=

.5Ad

j. k=

1Ad

j. k

=2Th

resh

.=2

Thre

sh.=

3

0.0

0.1

0.2

0.3

0.4

0.5

0.6

(a) All patients

Prop

ortio

n of

cor

rect

pre

dict

ions

Myocardial  infarc8on  in  pa8ents  with    hypertension,  in  treatment  (T)  and  placebo  (P)  groups  

Key:   Middle  half  

Middle  90%  

Mean  of  posterior  means  

HARM   Confidence  

P T 40−50

P T 51−60

P T 61−70

P T Over 70

T P 40−50

T P 51−60

T P 61−70

T P Over 70

Rescaled

 Risk

 

Key:   Middle  half  

Middle  90%  

Mean  of  posterior  means  

HARM   Confidence  

Myocardial  infarc8on  in  pa8ents  with  high  cholesterol,  in  treatment  (T)  and  placebo  (P)  groups  

P T 40−50

P T 51−60

P T 61−70

P T Over 70

P T 40−50

P T 51−60

P T 61−70

P T Over 70

Rescaled

 Risk

 

•  Part 1: Humans can interpret the predictions, and understand the full algorithm

•  Part 2: Bayesian hierarchical modeling with rules

•  Part 3: Accurate rule classifiers using MIO

 Sequen8al  Event  Predic8on  with  Associa8on  Rules    (R,  Letham,  Aouissi,  Kogan,  Madigan)  -­‐  COLT  2011  

A  Hierarchical  Model  for  Associa8on  Rule  Mining  of  Sequen8al  Events:  An  Approach  to  Automated  Medical  Symptom  Predic8on.      (McCormick,  R,  Madigan)  –  Annals  of  Applied  Sta8s8cs,  forthcoming  2012  

Ordered  Rules  for  Classifica8on:  A  Discrete  Op8miza8on  Approach  to  Associa8ve  Classifica8on    (Bertsimas,  Chang,  R)  –  In  progress  

Outline

Mixed Integer Optimization

•  MIO/MIP is a style of mathematical programming •  Not generally used for ML – perception from 1970’s that MIO’s are intractable

Mixed Integer Optimization

•  MIO/MIP is a style of mathematical programming •  Not generally used for ML – perception from 1970’s that MIO’s are intractable  

•  Not all valid MIO formulations are equally strong

Mixed Integer Optimization

•  MIO/MIP is a style of mathematical programming •  Not generally used for ML – perception from 1970’s that MIO’s are intractable

•  Not all valid MIO formulations are equally strong •  Can use LP relaxations for very large scale problems

Mixed Integer Optimization

•  MIO/MIP is a style of mathematical programming •  Not generally used for ML – perception from 1970’s that MIO’s are intractable

•  Not all valid MIO formulations are equally strong •  Can use LP relaxations for very large scale problems    •  Associa8on  rules  historically  plagued  by  “combinatorial  explosion”...    

Ordered  Rules  for  Classifica8on  •  Minimize  misclassifica8on  error,  regularize  by  height  of  the  highest  null  rule.  

43  

“null rules”: higher one predicts the default class and ends the list.

MIO  Learning  Algorithm  

MIO  Learning  Algorithm  Maximize  classificaAon  accuracy  

MIO  Learning  Algorithm  Maximize  classificaAon  accuracy  

Maximize  rank  of  the  highest  null  rule  (regularizaAon)  

Experiments  

•  Five  algorithms  – Logis8c  Regression  (LogReg)  – Support  Vector  Machines  /  RBF  kernel  (SVM)    – Classifica8on  and  Regression  Trees  (CART)    – Boosted  Decision  Trees  (AdaBoost)  – Ordered  Rules  for  Classifica8on  (ORC)  

•  Several  publicly  available  datasets  (UCI)  •  Accuracy  averaged  over  3  folds  

Classifica8on  Accuracy  

o  

~x  

o  

o   o  

~x   ~x  

x  

x  

~o  

yes   no  

1  

0  

.26  

.47   .92  

:  :  

:   :  

:   :  

CART  on  Tic  Tac  Toe  

CART  accuracy  =  0.9388715  

ORC  on  Tic  Tac  Toe  

x  

x  

x  

x   x   x  

x  

x  

x  

x  

x  

x  

x  

x  

x  

x  

x  

x  

x   x   x  

x   x   x  

x  wins  

1   2   3   4   5   6   7   8  

9  

x  wins  x  wins   x  wins  x  wins   x  wins  x  wins  

x  does  not  win  

x  wins  

ORC  accuracy  =  1  

MONKS  Problems  1  

•  6  Integer  valued  features  taking  values  1,2,3,4  •  Examples  are  in  class  1  if  either  a1=a2  or  a5=1  

CART  on  MONKS  Problems  1  

•  Examples  are  in  class  1  if  either  a1=a2  or  a5=1  

ORC  on  MONKS  Problems  1  

•  Examples  are  in  class  1  if  either  a1=a2  or  a5=1  

a1=3, a2=3 →1 (33/33)a1=2, a2=2 →1 (30/30)a5=1 →1 (65/65) a1=1, a2=1 →1 (31/31)∅ →−1 (152/288)

•  The  bo:om  line:  You  don’t  need  to  sacrifice  accuracy  to  get  interpretability.  

•  Part 1: Humans can interpret the predictions, and understand the full algorithm

•  Part 2: Bayesian hierarchical modeling with rules

•  Part 3: Accurate rule classifiers using MIO

 Sequen8al  Event  Predic8on  with  Associa8on  Rules    (R,  Letham,  Aouissi,  Kogan,  Madigan)  -­‐  COLT  2011  

A  Hierarchical  Model  for  Associa8on  Rule  Mining  of  Sequen8al  Events:  An  Approach  to  Automated  Medical  Symptom  Predic8on.      (McCormick,  R,  Madigan)  –  Annals  of  Applied  Sta8s8cs,  forthcoming  2012  

Ordered  Rules  for  Classifica8on:  A  Discrete  Op8miza8on  Approach  to  Associa8ve  Classifica8on    (Bertsimas,  Chang,  R)  –  In  progress  

Outline

current  work  coming  up    

Associa8on  Rules/  Associa8ve  Classifica8on  

Decision  Trees  

Decision  Lists  

Logical  Analysis  of  Data  (LAD)  

Bayesian  Analysis  ML  algorithms  that  use  rules  as  features  

Current  Work  •  Machine  Learning  for  the  NYC  Power  Grid    

–  cover  of  IEEE  Computer,  spotlight  issue  for  IEEE  TPAMI  in  February,  WIRED  Science,  Slashdot,  US  News  &  World  Report...  

•  Supervised  Ranking,  Equivalences  between  Ranking  and  Classifica8on,  Ranking  with  MIO  

•  Reverse-­‐Engineering  Quality  Rankings  –  in  Businessweek  last  week  

•  ML  algorithms  that  understand  how  they  will  be  used  for  a  subsequent  task    

•  Several  other  projects  

Thank  you!  

top related