exact inference in bayesian networks using mapreduce (hadoop summit 2010)

20
Exact Inference in Bayesian Networks using MapReduce Alex Kozlov Cloudera, Inc.

Upload: alex-kozlov

Post on 22-Jun-2015

3.972 views

Category:

Technology


2 download

DESCRIPTION

Compilation of some ideas over the last 15 years...

TRANSCRIPT

Page 1: Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

Exact Inference in Bayesian Networks using MapReduceAlex KozlovCloudera, Inc.

Page 2: Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

About Me

About Cloudera

Bayesian (Probabilistic) Networks

BN Inference 101

CPCS Network

Why BN Inference

Inference with MR

Results

Conclusions2

Session Agenda

Page 3: Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

Worked on BN Inference in 1995-1998 (for Ph.D.)› Published the fastest implementation at the time

Worked on DM/BI field since then

Recently joined Cloudera, Inc.› Started looking at how to solve world’s hardest problems

3

About Me

Page 4: Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

Founded in the summer 2008Cloudera helps organizations profit from all of their data. We deliver the

industry-standard platform which consolidates, stores and processes any kind of data, from any source, at scale. We make it possible to do more powerful analysis of more kinds of data, at scale, than ever before. With Cloudera, you get better insight into their customers, partners, vendors and businesses.

Cloudera’s platform is built on the popular open source Apache Hadoopproject. We deliver the innovative work of a global community of contributors in a package that makes it easy for anyone to put the power of Google, Facebook and Yahoo! to work on their own problems.

4

About Cloudera

Page 5: Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

1. Nodes

2. Edges

3. Probabilities

5

Bayesian Networks

Bayes, Thomas (1763)An essay towards solving a problem in the doctrine of chances, published posthumously by his friendPhilosophical Transactions of the Royal Society of London, 53:370-418

Page 6: Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

1. Computational biology and bioinformatics (gene regulatory networks, protein structure, gene expression analysis)

2. Medicine

3. Document classification, information retrieval

4. Image processing

5. Data fusion

6. Gaming

7. Law

8. On-line advertising!

6

Applications

Page 7: Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

7

A Simple BN Network

Sprinkler

Rain

Wet Driveway

T F

0.2 0.8

T F

0.01 0.990.8 0.20.9 0.10.99 0.01

F, FF, TT, FT, T

T F

0.4 0.60.1 0.9

FT

Pr(Rain | Wet Driveway)

Pr(Sprinkler Broken | !Wet Driveway & !Rain)

Sprinkler, Rain

Rain

Page 8: Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

8

Asia Network

Pr(Tuberculosis | Visit to Asia)

Pr(Visit to Asia) Pr(Smoking)

Pr(Bronchitis | Smoking)

Pr(Lung Cancer | Smoking)

Pr(X-Ray | Lung Cancer or Tuberculosis) Pr(Dyspnea | CG )

Pr(C | BE )

Pr(Lung Cancer | Neg X-Ray & Positive Dyspnea)

Page 9: Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

JPD = <product of all probabilities and conditional probabilities in the network> = Pr(A, B, …, H)

PAB =

SELECT A, B, SUM(PROB) FROM JPD GROUP BY A, B;

PB = SELECT B, SUM(PROB) FROM PAB GROUP BY A;

Pr(A|B) = Pr(A,B)/Pr(B) – Bayes’ rule

CPCS is 422 nodes, a table of at least 2422 rows!

9

BN Inference 101 (in Hive)

Page 10: Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

Pr(Visit to Asia)

10

Junction Tree

Pr(C | BE )

Pr(Lung Cancer | Dyspnea) =Pr(E|H)

Pr(Tuberculosis | Visit to Asia)

Pr(D| C)

Pr(F)

Pr(G | F )

Pr(E | F )

Pr(H | CG )

Page 11: Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

11

CPCS Networks

422 nodes

14 nodes describe diseases

33 risk factors

375 various findings related to diseases

Page 12: Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

12

CPCS Networks

Page 13: Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

Choose the right tool for the right job!

BN is an abstraction for reasoning and decision making

Easy to incorporate human insight and intuitions

Very general, no specific ‘label’ node

Easy to do ‘what-if’, strength of influence, value of information, analysis

Immune to Gaussian assumptions

It’s all just a joint probability distribution

13

Why Bayesian Network Inference?

Page 14: Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

Map & Reduces

14

A1B1A2B1A1B2A2B2

B1

B2

B1C1E1B1C1E2B1C2E1B1C2E2B2C1E1B2C1E2B2C2E1B2C2E2

C1D1C2D1C1D2C2D2

C1

C2

B1C1E1B1C1E2B1C2E1B1C2E2B2C1E1B2C1E2B2C2E1B2C2E2

BCE

∑ Pr(B1| A) x ∑ Pr(D| C1)

Pr(C| BE) x ∑ Pr(B1| A) x ∑ Pr(D| C1)

Keys

Reduce

Map

Aggregation 1 (+)

Aggregation 2 (x)

Page 15: Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

for each clique in depth-first order:MAP:

Sum over the variables to get ‘clique message’ (requires state, custom partitioner and input format)

Emit factors for the next clique

REDUCE:Multiply the factors from all children

Include probabilities assigned to the clique

Form the new clique values

the MAP is done over all child cliques

15

MapReduce Implementation

Page 16: Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

o Topological parallelism: compute branches C2 and C4 in parallel

o Clique parallelism: divide computation of each clique into maps/reducers

o Fall back into optimal factoring if a corresponding subtree is small

o Combine multiple phases together

o Reduce replication level

16

Cliques, Trees, and Parallelism

C6

C5

C4

C1

C2

C3

Cliques may be larger than they appear!

Page 17: Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

CPCS:The 360-node subnet has the largest ‘clique’ of

11,739,896 floats (fits into 2GB)

The full 422-node version (absent, mild, moderate, severe)3,377,699,720,527,872 floats (or 12 PB of storage, but do not

need it for all queries)

In most cases do not need to do inference on the full network

17

CPCS Inference

Page 18: Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

1‘used an SGI Origin 2000 machine with sixteen MIPS R10000 processors (195 MHz clock speed)’ in 1997

2Macbook Pro 4 GB DDR3 2.53 GHz310 node Linux Xeon cluster 24 GB quad 2-core

18

Results

Network Memory Time(19971)

MacbookPro (20102)

Hadoop(& future3)

Random(B)

10 MB 33 sec < 1 sec

Random (A)

254 MB 260 sec 10 sec

cpcs360 2 GB 640 sec 15 sec 1 mincpcs422 > 12 PB N/A N/A Minutes to hours for

most of the queries on most of the clusters

Page 19: Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

Exact probabilistic inference is finally in sight for the full 422 node CPCS network

Hadoop helps to solve the world’s hardest problems

What you should know after this talk

BN is a DAG and represents a joint probability distribution (JPD)

Can compute conditional probabilities by multiplying and summing JPD

For large networks, this may be PBytes of intermediate data, but it’s MR

19

Conclusions

Page 20: Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

Questions?

alexvk@{cloudera,gmail}.com