exact inference in bayesian networks using mapreduce (hadoop summit 2010)
DESCRIPTION
Compilation of some ideas over the last 15 years...TRANSCRIPT
Exact Inference in Bayesian Networks using MapReduceAlex KozlovCloudera, Inc.
About Me
About Cloudera
Bayesian (Probabilistic) Networks
BN Inference 101
CPCS Network
Why BN Inference
Inference with MR
Results
Conclusions2
Session Agenda
Worked on BN Inference in 1995-1998 (for Ph.D.)› Published the fastest implementation at the time
Worked on DM/BI field since then
Recently joined Cloudera, Inc.› Started looking at how to solve world’s hardest problems
3
About Me
Founded in the summer 2008Cloudera helps organizations profit from all of their data. We deliver the
industry-standard platform which consolidates, stores and processes any kind of data, from any source, at scale. We make it possible to do more powerful analysis of more kinds of data, at scale, than ever before. With Cloudera, you get better insight into their customers, partners, vendors and businesses.
Cloudera’s platform is built on the popular open source Apache Hadoopproject. We deliver the innovative work of a global community of contributors in a package that makes it easy for anyone to put the power of Google, Facebook and Yahoo! to work on their own problems.
4
About Cloudera
1. Nodes
2. Edges
3. Probabilities
5
Bayesian Networks
Bayes, Thomas (1763)An essay towards solving a problem in the doctrine of chances, published posthumously by his friendPhilosophical Transactions of the Royal Society of London, 53:370-418
1. Computational biology and bioinformatics (gene regulatory networks, protein structure, gene expression analysis)
2. Medicine
3. Document classification, information retrieval
4. Image processing
5. Data fusion
6. Gaming
7. Law
8. On-line advertising!
6
Applications
7
A Simple BN Network
Sprinkler
Rain
Wet Driveway
T F
0.2 0.8
T F
0.01 0.990.8 0.20.9 0.10.99 0.01
F, FF, TT, FT, T
T F
0.4 0.60.1 0.9
FT
Pr(Rain | Wet Driveway)
Pr(Sprinkler Broken | !Wet Driveway & !Rain)
Sprinkler, Rain
Rain
8
Asia Network
Pr(Tuberculosis | Visit to Asia)
Pr(Visit to Asia) Pr(Smoking)
Pr(Bronchitis | Smoking)
Pr(Lung Cancer | Smoking)
Pr(X-Ray | Lung Cancer or Tuberculosis) Pr(Dyspnea | CG )
Pr(C | BE )
Pr(Lung Cancer | Neg X-Ray & Positive Dyspnea)
JPD = <product of all probabilities and conditional probabilities in the network> = Pr(A, B, …, H)
PAB =
SELECT A, B, SUM(PROB) FROM JPD GROUP BY A, B;
PB = SELECT B, SUM(PROB) FROM PAB GROUP BY A;
Pr(A|B) = Pr(A,B)/Pr(B) – Bayes’ rule
CPCS is 422 nodes, a table of at least 2422 rows!
9
BN Inference 101 (in Hive)
Pr(Visit to Asia)
10
Junction Tree
Pr(C | BE )
Pr(Lung Cancer | Dyspnea) =Pr(E|H)
Pr(Tuberculosis | Visit to Asia)
Pr(D| C)
Pr(F)
Pr(G | F )
Pr(E | F )
Pr(H | CG )
11
CPCS Networks
422 nodes
14 nodes describe diseases
33 risk factors
375 various findings related to diseases
12
CPCS Networks
Choose the right tool for the right job!
BN is an abstraction for reasoning and decision making
Easy to incorporate human insight and intuitions
Very general, no specific ‘label’ node
Easy to do ‘what-if’, strength of influence, value of information, analysis
Immune to Gaussian assumptions
It’s all just a joint probability distribution
13
Why Bayesian Network Inference?
Map & Reduces
14
A1B1A2B1A1B2A2B2
B1
B2
B1C1E1B1C1E2B1C2E1B1C2E2B2C1E1B2C1E2B2C2E1B2C2E2
C1D1C2D1C1D2C2D2
C1
C2
B1C1E1B1C1E2B1C2E1B1C2E2B2C1E1B2C1E2B2C2E1B2C2E2
BCE
∑ Pr(B1| A) x ∑ Pr(D| C1)
Pr(C| BE) x ∑ Pr(B1| A) x ∑ Pr(D| C1)
Keys
Reduce
Map
Aggregation 1 (+)
Aggregation 2 (x)
for each clique in depth-first order:MAP:
Sum over the variables to get ‘clique message’ (requires state, custom partitioner and input format)
Emit factors for the next clique
REDUCE:Multiply the factors from all children
Include probabilities assigned to the clique
Form the new clique values
the MAP is done over all child cliques
15
MapReduce Implementation
o Topological parallelism: compute branches C2 and C4 in parallel
o Clique parallelism: divide computation of each clique into maps/reducers
o Fall back into optimal factoring if a corresponding subtree is small
o Combine multiple phases together
o Reduce replication level
16
Cliques, Trees, and Parallelism
C6
C5
C4
C1
C2
C3
Cliques may be larger than they appear!
CPCS:The 360-node subnet has the largest ‘clique’ of
11,739,896 floats (fits into 2GB)
The full 422-node version (absent, mild, moderate, severe)3,377,699,720,527,872 floats (or 12 PB of storage, but do not
need it for all queries)
In most cases do not need to do inference on the full network
17
CPCS Inference
1‘used an SGI Origin 2000 machine with sixteen MIPS R10000 processors (195 MHz clock speed)’ in 1997
2Macbook Pro 4 GB DDR3 2.53 GHz310 node Linux Xeon cluster 24 GB quad 2-core
18
Results
Network Memory Time(19971)
MacbookPro (20102)
Hadoop(& future3)
Random(B)
10 MB 33 sec < 1 sec
Random (A)
254 MB 260 sec 10 sec
cpcs360 2 GB 640 sec 15 sec 1 mincpcs422 > 12 PB N/A N/A Minutes to hours for
most of the queries on most of the clusters
Exact probabilistic inference is finally in sight for the full 422 node CPCS network
Hadoop helps to solve the world’s hardest problems
What you should know after this talk
BN is a DAG and represents a joint probability distribution (JPD)
Can compute conditional probabilities by multiplying and summing JPD
For large networks, this may be PBytes of intermediate data, but it’s MR
19
Conclusions
Questions?
alexvk@{cloudera,gmail}.com