exact inference in bayesian networks using mapreduce (hadoop summit 2010)

Exact Inference in Bayesian Networks using MapReduceAlex KozlovCloudera, Inc.

About Me

About Cloudera

Bayesian (Probabilistic) Networks

BN Inference 101

CPCS Network

Why BN Inference

Inference with MR

Results

Conclusions2

Session Agenda

Worked on BN Inference in 1995-1998 (for Ph.D.)› Published the fastest implementation at the time

Worked on DM/BI field since then

Recently joined Cloudera, Inc.› Started looking at how to solve world’s hardest problems

3

About Me

Founded in the summer 2008Cloudera helps organizations profit from all of their data. We deliver the

industry-standard platform which consolidates, stores and processes any kind of data, from any source, at scale. We make it possible to do more powerful analysis of more kinds of data, at scale, than ever before. With Cloudera, you get better insight into their customers, partners, vendors and businesses.

Cloudera’s platform is built on the popular open source Apache Hadoopproject. We deliver the innovative work of a global community of contributors in a package that makes it easy for anyone to put the power of Google, Facebook and Yahoo! to work on their own problems.

4

About Cloudera

1. Nodes

2. Edges

3. Probabilities

5

Bayesian Networks

Bayes, Thomas (1763)An essay towards solving a problem in the doctrine of chances, published posthumously by his friendPhilosophical Transactions of the Royal Society of London, 53:370-418

1. Computational biology and bioinformatics (gene regulatory networks, protein structure, gene expression analysis)

2. Medicine

3. Document classification, information retrieval

4. Image processing

5. Data fusion

6. Gaming

7. Law

8. On-line advertising!

6

Applications

7

A Simple BN Network

Sprinkler

Rain

Wet Driveway

T F

0.2 0.8

T F

0.01 0.990.8 0.20.9 0.10.99 0.01

F, FF, TT, FT, T

T F

0.4 0.60.1 0.9

FT

Pr(Rain | Wet Driveway)

Pr(Sprinkler Broken | !Wet Driveway & !Rain)

Sprinkler, Rain

Rain

JPD = <product of all probabilities and conditional probabilities in the network> = Pr(A, B, …, H)

PAB =

SELECT A, B, SUM(PROB) FROM JPD GROUP BY A, B;

PB = SELECT B, SUM(PROB) FROM PAB GROUP BY A;

Pr(A|B) = Pr(A,B)/Pr(B) – Bayes’ rule

CPCS is 422 nodes, a table of at least 2422 rows!

9

BN Inference 101 (in Hive)

11

CPCS Networks

422 nodes

14 nodes describe diseases

33 risk factors

375 various findings related to diseases

12

CPCS Networks

Choose the right tool for the right job!

BN is an abstraction for reasoning and decision making

Easy to incorporate human insight and intuitions

Very general, no specific ‘label’ node

Easy to do ‘what-if’, strength of influence, value of information, analysis

Immune to Gaussian assumptions

It’s all just a joint probability distribution

13

Why Bayesian Network Inference?

Map & Reduces

14

A1B1A2B1A1B2A2B2

B1

B2

B1C1E1B1C1E2B1C2E1B1C2E2B2C1E1B2C1E2B2C2E1B2C2E2

C1D1C2D1C1D2C2D2

C1

C2

B1C1E1B1C1E2B1C2E1B1C2E2B2C1E1B2C1E2B2C2E1B2C2E2

BCE

∑ Pr(B1| A) x ∑ Pr(D| C1)

Pr(C| BE) x ∑ Pr(B1| A) x ∑ Pr(D| C1)

Keys

Reduce

Map

Aggregation 1 (+)

Aggregation 2 (x)

for each clique in depth-first order:MAP:

Sum over the variables to get ‘clique message’ (requires state, custom partitioner and input format)

Emit factors for the next clique

REDUCE:Multiply the factors from all children

Include probabilities assigned to the clique

Form the new clique values

the MAP is done over all child cliques

15

MapReduce Implementation

o Topological parallelism: compute branches C2 and C4 in parallel

o Clique parallelism: divide computation of each clique into maps/reducers

o Fall back into optimal factoring if a corresponding subtree is small

o Combine multiple phases together

o Reduce replication level

16

Cliques, Trees, and Parallelism

C6

C5

C4

C1

C2

C3

Cliques may be larger than they appear!

CPCS:The 360-node subnet has the largest ‘clique’ of

11,739,896 floats (fits into 2GB)

The full 422-node version (absent, mild, moderate, severe)3,377,699,720,527,872 floats (or 12 PB of storage, but do not

need it for all queries)

In most cases do not need to do inference on the full network

17

CPCS Inference

1‘used an SGI Origin 2000 machine with sixteen MIPS R10000 processors (195 MHz clock speed)’ in 1997

2Macbook Pro 4 GB DDR3 2.53 GHz310 node Linux Xeon cluster 24 GB quad 2-core

18

Results

Network Memory Time(19971)

MacbookPro (20102)

Hadoop(& future3)

Random(B)

10 MB 33 sec < 1 sec

Random (A)

254 MB 260 sec 10 sec

cpcs360 2 GB 640 sec 15 sec 1 mincpcs422 > 12 PB N/A N/A Minutes to hours for

most of the queries on most of the clusters

Exact probabilistic inference is finally in sight for the full 422 node CPCS network

Hadoop helps to solve the world’s hardest problems

What you should know after this talk

BN is a DAG and represents a joint probability distribution (JPD)

Can compute conditional probabilities by multiplying and summing JPD

For large networks, this may be PBytes of intermediate data, but it’s MR

19

Conclusions

Questions?

alexvk@{cloudera,gmail}.com

exact inference in bayesian networks using mapreduce (hadoop summit 2010)

Technology

bn inference inference

cpcs inference cpcs

bayesian network inference

exact inference

node cpcs network hadoop

bayesian networks

sec b random

clique message