bayesian networks with r and hadoop

© Hortonworks Inc. 2014

HortonworksBayesian Networks with R and HadoopHadoop Summit, June 2014Ofer Mendelevitch


A bit about me

Ofer MendelevitchDirector, Data Science @ HortonworksPreviously: Nor1, Yahoo!, Risk Insight, QuiverPersonal blog: www.achessdad.com

http://www.achessdad.com/


What I will cover today…

•What is a Bayesian Network?

•Why I think it’s cool

•Bayesian networks with R: the bnlearn package

•Bayes Networks Inference with R and Hadoop


Introduction to Bayesian Networks

(with examples using R)


Example: “Asia” Bayesian NetworkEach node is a random variable: yes/no

Visit to Asia Smoking

Tuberculosis Lung cancer Bronchitis

Tuberculosis or cancer

X-ray result Shortness of breath


Example: “Asia” Bayesian NetworkGraph structure reflects “causal” relationships






Example: “Asia” Bayesian Networknode CPT: P(node | parents)





SoB

Tub or Cancer

Bronchitis T F

T T 0.7 0.3

F T 0.4 0.6

T F 0.45 0.55

F F 0.05 0.95

CPT


What is a (discrete) Bayesian Network?(also called Bayes Nets, Belief Nets, etc)

• A network structure (DAG):– Nodes => random variables, taking discrete values– Edges => conditional dependencies

• E.g., lung cancer is statistically dependent on smoking

• A set of conditional probability tables (CPTs):– Each node has a set of parents, determined by the graph– CPT holds P(node | parent-A, parent-B, …) for each node


Why are Bayesian Networks cool?

• Intuitive/adaptive modeling tool:– Graphs are natural for modeling relationships– Easy to combine data-driven learning with expert know-how– You can start small, and add knowledge as it is acquired

• “Naturally” addresses inference with missing values

• Inference can be applied to any variable/node– As opposed to a single (target) variable in supervised learning


Bayesian networks have been successfully used for a variety of real-world applications

• Healthcare: medical diagnosis, genetic modeling • Security: crime pattern analysis, terrorism risk

management• Education: student modeling• Finance: credit rating, predicting defaults• Tech support: troubleshooting for computers/printers

See “Bayesian networks: a practical guide to applications”, Pourret et al


Bayesian networks with R

• http://cran.r-project.org/web/views/Bayesian.html

• We will focus on “bnlearn” (by Marco Scutari)– Implements various structure learning algorithms (hc, tabu,

gs, iamb, mmhc, rsmax2, etc)– Provides automated learning of CPT– Approximate inference: “likelihood sampling” and “likelihood

weighting”– Supports snow/parallel for some algorithms

http://cran.r-project.org/web/views/Bayesian.html




Step 1: Constructing the graph





• Manually (expert knowledge)• Automatically from data


Manual graph construction: Asia> library(bnlearn)> varnames = c("Asia", "Smoking", "Tub", "LC", "Bronchitis", "Tub-or-LC", "X-ray", "SoB")> ag = empty.graph(varnames)> arcs(ag, ignore.cycles=T) = data.frame(> "from”=c("Asia", "Smoking", "Smoking", "Tub", "LC", "Bronchitis", "Tub-or-LC", "Tub-or-LC"),> "to”=c("Tub", "LC", "Bronchitis", "Tub-or-LC", "Tub-or-LC", "SoB", "X-ray", "SoB"))> graphviz.plot(ag)


Automated graph construction: Asia> library(bnlearn)> varnames = c("Asia", "Smoking", "Tub", "LC", "Bronchitis", "Tub-or-LC", "X-ray", "SoB")> data(asia); names(asia) = varnames> bg = hc(asia)> graphviz.plot(bg)


Automated learning does not always work perfectly…

For example:• May not learn all the “expected” edges• May learn in the wrong direction

Therefore, in practice it helps to:• Provide whitelist and blacklist to the algorithm• Pre-seed with a manual networks structure, and let the

algorithm learn from there• Ensemble learning of structure (see boot.strength)


Step 2: Learning the CPT / probabilities





SoB

Tub or Cancer

Bronchitis T F

T T 0.85 0.15

F T 0.79 0.21

T F 0.73 0.27

F F 0.1 0.9

CPT


Learning CPT for each node in the graph> fitted = bn.fit(ag, asia)

> print(fitted$SoB)

Parameters of node SoB (multinomial distribution)Conditional probability table: , , Tub-or-LC = no BronchitisSoB no yes no 0.90017286 0.21373057 yes 0.09982714 0.78626943

, , Tub-or-LC = yes BronchitisSoB no yes no 0.27737226 0.14592275 yes 0.72262774 0.85407725


Using the BN for inference

• Given evidence: (1) visit to asia, (2) SoB (3) Bronchitis• What is the likelihood of “lung cancer”?






Inferring with missing values

• We provide evidence (“yes” or “no” in this case) only for those nodes where we have such evidence

• If a value is “missing” it’s just not included in the evidence when doing inference…

This is in contrast to supervised learning, where ALL values are typically needed for inference.


Exact Inference with gRain

• The gRain package implements exact inference for discrete Bayesian Networks using the “Junction Tree” belief propagation algorithm

• Bnlearn/gRain cooperate nicely

> jtree = compile(as.grain(fitted))> jp = setFinding(jtree, nodes = c("Asia", "Sob", "Bronchitis"), states = c("yes", "yes", "yes"))> print(querygrain(jp, nodes="LC")$LC)

LC no yes 0.934 0.066


Approximate inference with bnlearn

Bnlearn implements approximate inference: logic sampling (aka rejection sampling) and likelihood weighting > # Infer probability P(SoB | Asia, Bronchitis) using logic sampling> p1 = cpquery(fitted, event = eval(SoB == 'yes'), evidence = eval(Asia == 'yes' & Bronchitis == 'yes'), method="ls")> print(p1)

[1] 0.8014706

> # Infer probability P(SoB | Asia, Bronchitis) using likelihood weighting> evidence = list("yes", "yes")> names(evidence) = c("Asia", "Bronchitis")> p2 = cpquery(fitted, eval(SoB == 'yes'), evidence, method="lw") > print(p2)

[1] 0.795404


Large scale Bayes Networks Inference with R and Hadoop


What is large?

• Number of nodes:– 10s: Medium – 100s: Large– 1000s: Very large

• Number of instances: – 100,000s to millions


Manually constructing large graphs is hard


Large scale learning in practice: manual + automated

• Define nodes• Seed with some known edges, based on expert

knowledge• Augment with automated learning (e.g., hc, tabu,

rsmax2, etc)


Large scale inference: Exact or Approximate?

Pros ConsExact (Jtree)gRain

Fast inference time Computational complexity determined (exponentially) by largest clique size

Approximate (LS, LW)Bnlearn

Can be used for any graphNot limited by “clique” size

Inference is often much slowerNot accurate for rare events


About RHadoop/RMR

• An open source project, supported by revolution analytics

• Various sub-projects: RMR, RHDFS, RHBASE, plyrmr, etc• We will focus on RMR

– Implement mapper/reducer code using R

• RHadoop: https://github.com/RevolutionAnalytics/RHadoop/wiki• Installing RMR on HDP: http://www.slideshare.net/Hadoop_Summit/enabling-r-on-

hadoophttp://www.research.janahang.com/install-rhadoop-on-hortonworks-hdp-2-0/

https://github.com/RevolutionAnalytics/RHadoop/wiki

https://github.com/RevolutionAnalytics/RHadoop/wiki

http://www.slideshare.net/Hadoop_Summit/enabling-r-on-hadoop

http://www.slideshare.net/Hadoop_Summit/enabling-r-on-hadoop

http://www.research.janahang.com/install-rhadoop-on-hortonworks-hdp-2-0/




Large scale inference with R and Hadoop

Infer with RMR

Inference is embarrassingly parallelHadoop determines # of mappers, based on file sizeSO we’ll use reducers to parallelize CPQuery


Example: Adult dataset

• Donated by Ronny Kohavi and Barry Becker, 1996 - http://archive.ics.uci.edu/ml/datasets/Adult

• Extracted from 1994 census data• 48842 instances, 14 features such as:

– Age, country, occupation, marital status, capital gain, etc– Goal: predict if income is >50K or not

…53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K37, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K49, Private, 160187, 9th, 5, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 16, Jamaica, <=50K52, Self-emp-not-inc, 209642, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K…

http://archive.ics.uci.edu/ml/datasets/Adult




Sample learned network structure for “adult”


Inference with RMR on adult dataset

NUM_REDUCERS = 4opt = rmr.options(backend = "hadoop”,

backend.parameters = list(hadoop=list(D="mapreduce.reduce.memory.mb=1024",

D=paste0("mapreduce.job.reduces=”, NUM_REDUCERS))))

inpFile = 'adult.test'outFile = 'adult.out'

mapreduce(input=inpFile, input.format="text", output=outFile, output.format="csv", map=map_func, reduce=reduce_func)


Our mapper: passing on to reducer…

map_func <- function(., values){ out_klist= list(); out_vlist = list() for (v in values) {

fvec = unlist(strsplit(v, ',', fixed=T)) # Read row and split into columns if (length(fvec)<15) { next; } # deal with row not in expected format

key = floor(runif(1, 0, NUM_REDUCERS)) out_klist = c(out_klist, key) out_vlist = c(out_vlist, v)

} return (keyval(out_klist, out_vlist))}


Our reducer: where all the action happens

trim <- function (x) gsub("^\\s+|\\s+$", "", x)

reduce_func <- function(., values){ out_klist = list(); out_vlist = list()

for (v in values) { increment.counter('bn-demo', 'row', 1) # to let MR know we are still active

fvec = sapply(strsplit(v, ',', fixed=T), trim) # read row and split into columns names(fvec)=c("age", "type_employer", "fnlwgt", "education", "education_num","marital", "occupation", "relationship", "race","sex", "capital_gain", "capital_loss", "hr_per_week", "country", "income")

pv = dataprep(fvec) # transform to “learned” features

evidence = as.list(pv[1,setdiff(colnames(pv), 'income')]) prob = cpquery(fitted, event = (income == ">50K"), evidence = evidence, method="lw") out_klist = c(out_klist, v) out_vlist = c(out_vlist, format(prob, digits=2)) } return (keyval(out_klist, out_vlist))}


Example output: adult.out

26, Private, 191573, Assoc-acdm, 12, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, >50K. ,0.3752, Private, 203635, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, >50K. ,0.1436, Private, 68798, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K., 0.01934, Private, 31752, HS-grad, 9, Divorced, Machine-op-inspct, Other-relative, White, Female, 0, 0, 40, ?, <=50K. ,0.1459, ?, 291856, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K. ,0.07426, Private, 135848, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 10, Guatemala, <=50K. ,0.0350, Local-gov, 237356, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 7298, 0, 40, United-States, >50K.,0.8956, Self-emp-not-inc, 140729, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 45, United-States, <=50K.,0.1422, Private, 54560, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, <=50K., 0.2145, Self-emp-inc, 88500, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 7298, 0, 40, United-States, >50K., 0.94


More information

• Detailed step-by-step guide and code used can be found on: https://github.com/ofermend/bayes-net-r-hadoop

• Download Hortonworks Sandbox http://hortonworks.com/products/hortonworks-sandbox/

• Further reading/learning:– http://www.bnlearn.com/– PGM class on Coursera: https://www.coursera.org/course/

pgm– PGM Ebook from UCL:

http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/250214.pdf– Many others…

https://github.com/ofermend/bayes-net-r-hadoop

https://github.com/ofermend/bayes-net-r-hadoop

http://hortonworks.com/products/hortonworks-sandbox/

http://hortonworks.com/products/hortonworks-sandbox/

http://www.bnlearn.com/

http://www.bnlearn.com/

https://www.coursera.org/course/pgm

https://www.coursera.org/course/pgm

http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/250214.pdf

http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/250214.pdf


Thank you!

Any Questions?

Ofer Mendelevitch, [email protected], @ofermend

We’re hiring! www.hortonworks.com/careers Hortonworks training: www.hortonworks.com/trainingHortonworks blog: www.hortonworks.com/blog

mailto:[email protected]

http://www.hortonworks.com/careers

http://www.hortonworks.com/training

http://www.hortonworks.com/blog

bayesian networks with r and hadoop

Technology

hortonworks bayesian

bronchitis sob

cancer bronchitis t

likelihood of lung cancer

asia bayesian network

cool bayesian networks

bayesian networks cool

learning cpt