reasoning under uncertainty

Reasoning Under Uncertainty

Radu Marinescu4C @ University College Cork

Why uncertainty?• Uncertainty in medical diagnosis

Diseases produce symptoms In diagnosis, observed symptoms => disease ID Uncertainties

• Symptoms may not occur• Symptoms may not be reported• Diagnostic tests are not perfect

– False positive, false negative

• How do we estimate confidence? P(disease | symptoms, tests) = ?

Why uncertainty?• Uncertainty in medical decision-making

Physicians, patients must decide on treatments Treatments may not be successful Treatments may have unpleasant side effects

• Choosing treatments Weigh risks of adverse outcomes

• People are BAD at reasoning intuitively about probabilities Provide systematic analysis

Outline• Probabilistic modeling with joint distributions• Conditional independence and factorization• Belief (or Bayesian) networks

Example networks and software• Inference in belief networks

Exact inference• Variable elimination, join-tree clustering, AND/OR search

Approximate inference• Mini-clustering, belief propagation, sampling

Bibliography• Judea Pearl. “Probabilistic reasoning in intelligent systems”, 1988

• Stuart Russell & Peter Norvig. “Artificial Intelligence. A Modern Approach”, 2002 (Ch 13-17)

• Kevin Murphy. "A Brief Introduction to Graphical Models and Bayesian Networks"http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html

• Rina Dechter. "Bucket Elimination: A Unifying Framework for Probabilistic Inference"http://www.ics.uci.edu/~csp/R48a.ps

• Rina Dechter. "Mini-Buckets: A General Scheme for Approximating Inference"http://www.ics.uci.edu/~csp/r62a.pdf

• Rina Dechter & Robert Mateescu. "AND/OR Search Spaces for Graphical Models".http://www.ics.uci.edu/~csp/r126.pdf

http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html

http://www.ics.uci.edu/~csp/R48a.ps

http://www.ics.uci.edu/~csp/r62a.pdf

http://www.ics.uci.edu/~dechter/ics-275b/Fall-2007/readings/r126.pdf

Reasoning under uncertainty• A problem domain is modeled by a list of (discrete)

random variables: X1, X2, …, Xn

• Knowledge about the problem is represented by a joint probability distribution: P(X1, X2, …, Xn)

Example• Alarm (Pearl88)

Story: In Los Angeles, burglary and earthquake are common. They both can trigger an alarm. In case of alarm, two neighbors John and Mary may call 911

Problem: estimate the probability of a burglary based on who has or has not called

Variables: • Burglary (B), Earthquake (E), Alarm (A), JohnCalls (J), MaryCalls (M)

Knowledge required by the probabilistic approach in order to solve this problem: P(B, E, A, J, M)

Joint probability distributionDefines probabilities for all possible value assignments to the variables in the set

Inference with joint probability distribution

• What is the probability of burglary given that Mary called, P(B=y | M=y)?

• Compute marginal probability:

• Compute answer (reasoning by conditioning):

JAE

MJAEBPMBP,,

),,,,(),(

B M P(B,M)

y y 0.000115

y n 0.000075

n y 0.00015

n n 0.99971

43.000015.0000115.0

000115.0)(

),()|(

yMPyMyBPyMyBP

Advantages• Probability theory well-established and well understood

• In theory, can perform arbitrary inference among the variables given a joint probability. This is because the joint probability contains information of all aspects of the relationships among the variables Diagnostic inference:

• From effects to causes• Example: P(B=y | M=y)

Predictive inference:• From causes to effects• Example: P(M=y | B=y)

Combining evidence: P(B=y | J=y, M=y, E=n)

• All inference sanctioned by probability theory and hence has clear semantics

Difficulty: complexity in model construction and inference

• In Alarm example: 32 numbers needed (parameters) Quite unnatural to assess

• P(B=y, E=y, A=y, J=y, M=y) Computing P(B=y | M=y) takes 29 additions

• In general, P(X1, X2, …, Xn) needs at least 2n numbers to specify the

joint probability distribution Knowledge acquisition difficult (complex, unnatural) Exponential storage and inference

Outline• Probabilistic modeling with joint distributions• Conditional independence and factorization• Belief networks


Exact inference Approximate inference

• Miscellaneous Mixed networks, influence diagrams, etc.

Chain rule and factorization• Overcome the problem of exponential size by

exploiting conditional independencies The chain rule of probability:

No gains yet. The number of parameters required by the factors is still O(2n)

n

iiin XXXPXXXP

XXXPXXPXPXXXPXXPXPXXP

11121

123121321

12121

),|(),,,(

),|()|()(),,()|()(),(

Conditional independence• A random variable X is conditionally

independent of a set of random variables Y given a set of random variables Z if P(X | Y, Z) = P(X | Z)

• Intuitively: Y tells us nothing more about X than we know by

knowing Z As far as X is concerned, we can ignore Y if we

know Z

Conditional independence• About P(Xi|X1,…,Xi-1):

Domain knowledge usually allows one to identify a subset pa(Xi) {X1, …, Xi-1} such that

• Given pa(Xi), Xi is independent of all variables in {X1,…,Xi-1} \ pa(Xi), i.e.

P(Xi | X1, …, Xi-1) = P(Xi | pa(Xi))

• Then

• Joint distribution factorized!• The number of parameters might have been substantially

reduced

n

iiin XpaXPXXXP

121 ))(|(),...,,(

Example continued

• pa(B) = {}, pa(E) = {}, pa(A) = {B,E}, pa(J) = {A}, pa(M) = {A}• Conditional probability tables (CPT)

)|()|(),|()()(),,,|(),,|(),|()|()(

),,,,(

AMPAJPEBAPEPBPJAEBMPAEBJPEBAPBEPBP

MJAEBP

B P(B)

Y .01

N .99

E P(E)

Y .02

N .98

M A P(M|A)Y Y .9N Y .1Y N .05N N .95

J A P(J|A)Y Y .7N Y .3Y N .01N N .99

A B E P(A|B,E)

Y Y Y .95

N Y Y .05

Y Y N .94

N Y N .06

Y N Y .29

N N Y .71

Y N N .001

N N N .999

Example continued• Model size reduced from 32 to 2+2+4+4+8=20• Model construction easier

Fewer parameters to assess Parameter more natural to assess

• e.g., P(B=y), P(J=y | A=y), P(A=y | B=y, E=y), etc.

• Inference easier. Will see this later.

Outline• Probabilistic modeling with joint distributions• Conditional Independence and factorization• Belief networks



From factorization to belief networks• Graphically represent the conditional independency

relationships: Construct a directed graph by drawing an arc from Xj to Xi iff Xj

pa(Xi)

Also attach the CPT P(Xi | pa(Xi)) to node Xi

B E

A

J M

P(B) P(E)

P(A|B,E)

P(J|A) P(M|A)

Formal definition• A belief network is:

A directed acyclic graph (DAG), where:• Each node represents a random variable• And is associated with the conditional probability of the node given

its parents Represents the joint probability distribution:

A variable is conditionally independent of its non-descendants given its parents

n

iiin XpaXPXXXP

121 ))(|(),...,,(

Independences in belief networks• 3 basic independence structures

Burglary

Alarm

JohnCalls

1: chain

Burglary

Alarm

Earthquake

2: common descendants

MaryCalls

Alarm

JohnCalls

3: common ancestors

Independences in belief networks

Burglary

Alarm

JohnCalls

1. JohnCalls is independent of Burglary given Alarm

)|()|()|,()|(),|(

ABPAJPABJPAJPBAJP


Burglary

Alarm

Earthquake

2. Burglary is independent of Earthquake not knowing Alarm.Burglary and Earthquake become dependent given Alarm!!

)|()|()|,()()(),(

AEPABPAEBPEPBPEBP


MaryCalls

Alarm

JohnCalls

3. MaryCalls is independent of JohnCalls given Alarm.

)|()|()|,()|(),|(

AMPAJPAMJPAJPMAJP

Independences in belief networks• BN models many conditional independence relations relating distant

variables and sets, which are defined in terms of the graphical criterion called d-separation

• d-separation = conditional independence Let X, Y and Z be three sets of nodes If X and Y are d-separated by Z, then X and Y are conditionally independent given

Z: P(X|Y, Z) = P(X|Z)

• d-separation in the graph: A is d-separated from B given C if every undirected path between them is

blocked

• Path blocking 3 cases that expand on three basic independence structures

Undirected path blocking• With “linear” substructure

• With “wedge” substructure (common ancestors)

• With “vee” substructure (common descendants)

X YZ in C

X YZ in C

X

Y

Z or any of its descendants not in C

Example

1

2 3

4

5

X Y

ZX = {2} and Y = {3} are d-separated by Z = {1}

• path 2 1 3 is blocked by 1 Z• path 2 4 3 is blocked because 4 and all itsdescendants are outside Z

X = {2} and Y = {3} are not d-separated by Z = {1,5}

• path 2 1 3 is blocked by 1 Z• path 2 4 3 is activated because 5 (which isa descendant of 4) is in Z

• learning the value of consequence 5 renders5’s causes 2 and 3 dependant

)|()|()|,( ZYPZXPZYXP

)|()|()|,( ZYPZXPZYXP

I-mapness• Given a probability distribution P on a set of

variables {X1, …, Xn}, a belief network B representing P is a minimal I-map (Pearl88) I-mapness: every d-separation condition displayed

in B corresponds to a valid conditional independence relationship in P

Minimal: none of the arrows in B can be deleted without destroying its I-mapness

Example network

PCWP COHRBP

HREKG HRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

ANAPHYLAXIS

MINOVL

PVSAT

FIO2PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

The “alarm” network: Monitoring Intensive-Care Patients37 variables, 509 parameters (instead of 237)

Software• GeNIe (University of Pittsburgh) - free

http://genie.sis.pitt.edu• SamIam (UCLA) - free

http://reasoning.cs.ucla.edu/SamIam/• Hugin - commercial

http://www.hugin.com• Netica - commercial

http://www.norsys.com• UCI Lab – free but no GUI

http://graphmod.ics.uci.edu/

http://genie.sis.pitt.edu/

http://reasoning.cs.ucla.edu/SamIam/

http://www.hugin.com/

http://www.norsys.com/

http://graphmod.ics.uci.edu/

GeNIe screenshot

Applications• Belief networks are used in:

Genetic linkage analysis Speech recognition Medical diagnosis Probabilistic error correcting coding Monitoring and diagnosis in distributed systems Troubleshooting (Microsoft) …

Outline• Probabilistic modeling with joint distributions• Conditional independence and factorization• Belief networks• Inference in belief networks


Exact inference• Variable elimination (inference)

Bucket elimination Bucket-Tree elimination Cluster-Tree elimination

• Conditioning (search) VE+C hybrid AND/OR search (tree, graph)

Belief updating

Smoking

BronchitisLung cancer

X-ray Dyspnoea

P(Lung cancer = yes | Smoking = no, Dyspnoea = yes) ?

Probabilistic inference tasks• Belief updating

• Maximum probable explanation (MPE)

• Maximum a posteriori hypothesis (MAP)

)|()( evidencexXPXBEL iii

),(maxarg* exPxx

AXa

k exPaa\

**1 ),(maxarg),...,(

The bucket operation

ELIMINATION: multiply (*) and sum (∑)

bucket(B): { P(E|B,C), P(D|A,B), P(B|A) }

λB(A,C,D,E) = ∑B P(B|A)*P(D|A,B)*P(E|B,C)

OBSERVED BUCKET:

bucket(B): { P(E|B,C), P(D|A,B), P(B|A), B=1 }

λB(A) = P(B=1|A) λB(A,D) = P(D|A,B=1)

λB(E,C) = P(E|B=1,C)

Multiplying functions

Summing out a variable

Bucket elimination

Bucket B:

Bucket C:

Bucket D:

Bucket E:

Bucket A:


P(C|A)

E=0

P(A)

∑∏ Elimination operator

λB(A,D,C,E)

λC(A,D,E)

λD(A,E)

λE(A)

P(A,E=0)

B

C

D

E

A

w* = 4“induced width”(max clique size)

Induced graph

B

C

D

E

A

A

B C

ED

P(A)

P(B|A)

P(E|B,C)

P(D|A,B)

P(C|A)

Induced width of the ordering w*(d)||

max width of the nodes

A

B C

ED

Complexity of elimination))(*exp(( dwnO

w*(d) – induced width of the moral graph along ordering d

A

B C

ED

“Moral” graph

B

C

D

E

A

w*(d1) = 4

E

D

C

B

A

w*(d2) = 2

Finding small induced-width orderings• NP-complete• A tree has induced width of ?• Greedy algorithms:

Min-width Min induced-width Max-cardinality Min-fill (thought as the best) Anytime min-width (via Branch-and-Bound)

MPE: Most Probable Explanation

Smoking

BronchitisLung Cancer

X-ray Dyspnoea

)1,,,,0(maxarg)1,',',',0( DXBCSPdxcbs

),|1(),0|()0|()0|()0()1,,,,0( CBDPCSXPSBPSCPSPDXBCSP

Applications• Probabilistic decoding

A stream of bits is transmitted across a noisy channel and the problem is to recover the transmitted stream given the observed output and parity check bits

x0 x1 x2 x3 x4

u0 u1 u2 u3 u4

y0u y1

u y2u y3

u y4u

y0x y1

x y2x y3

x y4x

Transmitted bits

Parity check bits

Received bits (observed)

Received parity check bits (observed)

Applications• Medical diagnosis

Given some observed symptoms, determine the most likely subset of diseases that may explain the symptoms

Symptom2

Symptom3

Symptom4

Symptom5

Symptom1 Symptom6

Disease1 Disease2 Disease4

Disease6Disease5

Disease3 Disease7

Applications• Genetic linkage analysis

Given the genotype information of a pedigree, infer the maximum likelihood haplotype configuration (maternal and paternal) of the unobserved individuals

2 1A AB B

a ab b

A aB b 3

genotyped

haplotype

S23m

L21fL21m

L23m

X21 S23f

L22fL22m

L23f

X22

X23

S13m

L11fL11m

L13m

X11 S13f

L12fL12m

L13f

X12

X13

Locus 1

Locus 2

(Fishelson & Geiger, 2002)

Bucket elimination for MPE

A

B C

ED

P(A)

P(C|A)P(B|A)

P(E|B,C)

P(D|A,B)

MPE =

maxA,E=0,D,C,B P(A) P(B|A) P(C|A) P(D|A,B) P(E|B,C) =

maxAP(A) maxE=0 maxD maxC P(C|A) maxB P(B|A) P(D|A,B) P(E|B,C)


Max out a variable

A B C f(A,B,C)

T T T 0.03

T T F 0.07

T F T 0.54

T F F 0.36

F T T 0.06

F T F 0.14

F F T 0.48

F F F 0.32

A C f(A,C)

T T 0.54

T F 0.36

F T 0.48

F F 0.32

max out B

Bucket elimination

Bucket B:

Bucket C:

Bucket D:

Bucket E:

Bucket A:


P(C|A)

E=0

P(A)

max∏ Elimination/combination operators

λB(A,D,C,E)

λC(A,D,E)

λD(A,E)

λE(A)

MPE value

B

C

D

E

A


width

4

3

1

1

0

Generating the MPE tupleBucket B:

Bucket C:

Bucket D:

Bucket E:

Bucket A:


P(C|A)

E=0

P(A)

λB(A,D,C,E)

λC(A,D,E)

λD(A,E)

λE(A)a’ = argmax P(A)∙λE(A)

e’ = 0

d’ = argmax λC(a’,D,e’)

c’ = argmax P(C|a’) ∙∙ λC(a’,d’,C,e’)

b’ = argmax P(e’|B,c’) ∙ ∙ P(d’|a’,B) P(B|a’)∙

Return (a’, b’, c’, d’, e’)

Complexity of elimination))(*exp(( dwnO

w*(d) – induced width of the moral graph along ordering d

A

B C

ED

“Moral” graph

B

C

D

E

A

w*(d1) = 4

E

D

C

B

A

w*(d2) = 2



• Conditioning (search) VE+C hybrid AND/OR search (tree, graph)

From BE to Bucket-Tree elimination• Motivation

BE computes P(evidence) or P(X|evidence) where X is the last variable in the ordering

What if we need all marginal probabilities P(Xi|evidence), where Xi {X1, X2, …, Xn} ?

• Run BE n times with Xi being the last variable• Inefficient! – induced width may vary significantly from

one ordering to another• SOLUTION: Bucket-Tree Elimination (BTE)

Bucket-Tree (more formal)• Bucket Tree

A bucket tree has each bucket Bi as a node and there is an arc from Bi to Bj if the function created at Bi was placed in Bj

• Graph-based definition Let Gd be the induced graph along d. Each variable

X and its earlier neighbors is a node BX. There is an arc from BX to BY if Y is the closest parent to X.

Bucket-Tree

A

B C

ED

P(A)

P(B|A)

P(E|B,C)

P(D|A,B)

Belief network

E

D

C

B

A

Induced graph

E,B,C

A,B,D

A,B,C

B,A

A

E

D

C

B

A

λE(B,C)

λD(A,B)λC(A,B)

λB(A)

Bucket tree

P(C|A)

Bucket-Tree propagation

u

Xn

X2

X1

v

h(u,v)

)},(),,(),...,,(),,({)()( 21 uvhuxhuxhuxhuubucket n

…

Compute the message:

),(elim)},({)(

),(vu

uvhubucketffvuh

h(x1,u)

h(xn,u) elim(u,v) = vars(u) – vars(v)

Upward messages in the bucket-tree

E,B,C

A,B,D

A,B,C

B,A

A

E

D

C

B

A

λE(B,C)

λD(A,B)λC(A,B)

λB(A)πA(A)

πC(B,C)

πB(A,B)πB(A,B)

A

CB

EC

DBA

CB

CDB

BA

BAACPCB

BAAABPBA

BAABPBA

APA

),()|(),(

),()()|(),(

),()|(),(

)()(

A

B C

ED

P(A)

P(B|A)

P(E|B,C)

P(D|A,B)

P(C|A)

Buckets -> Super-buckets -> Clusters

G,F

F,B,C

D,B,A

A,B,C

B,A

A

F

B,C A,BA,B

A

G,F

F,B,C

D,B,A

A,B,C

F

B,CA,B

G,F

A,B,C,D,F

F

A

B C

FD

P(A)

P(B|A)

P(F|B,C)

P(D|A,B)

Time-space trade off! G

P(C|A)

P(G|F)

Tree decomposition• A tree decomposition for a belief network ‹X,D,G,P› is a

triple ‹T,χ,ψ›, where T=(V,E) is a tree, and χ and ψ are labeling functions, associating with each vertex v V two sets χ(v) V and ψ(v) P such that: For each function (CPT) pi P there is exactly one vertex such

that pi ψ(v) and scope(pi) χ(v) For each variable Xi X, the set {v V | Xi χ(v)} forms a

connected sub-tree (running intersection property)

• A join-tree is a tree decomposition where all clusters are maximal E.g., a bucket-tree is a tree decomposition but not a join-tree

Treewidth and separator• The width (aka treewidth) of a tree

decomposition ‹T,χ,ψ› is max|χ(v)|, and its hyperwidth is max|ψ(v)|

• Given two adjacent vertices u and v of a tree decomposition, a separator of u and v is defined as sep(u,v) = χ(u) χ(v)

Finding join-tree decompositions• Good join trees using triangulation

Create induced graph G’ along some ordering d Identify all maximal cliques in G’ Order cliques {C1, C2, …, Ct} by rank of the highest

vertex in each clique Form the join tree by connecting each Ci to a

predecessor Cj (j < i) sharing the largest number of vertices with Ci

CTE - properties• Correctness and completeness

Algorithm CTE is correct, i.e. it computes the exact joint probability of a single variable and the evidence

• Time complexity: O(deg x (n+N) x dw*+1)• Space complexity: O(N x dsep)

» deg = max degree of a node in T» n = number of variables (=number of CPTs)» N = number of nodes in T» d = maximum domain size» w* = induced width» sep = separator size



• Conditioning (search) Cycle cutset scheme VE+C hybrid AND/OR search (tree, graph)

Conditioning

0 0 0 0

0 1 0 1 0 1 0 1

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0 1 0 1

E

C

D

B

A 0 1

A

B C

ED

P(A)

P(B|A)

P(E|B,C)

P(D|A,B)

P(C|A)

?)0(

)0,()0|(

EP

EAPEAP

P(A=0)P(B=0|A=0)P(C=0|A=0)P(E=0|B=0,C=0)P(D=0|A=0,B=0)P(A=0)P(B=0|A=0)P(C=0|A=0)P(E=0|B=0,C=0)P(D=1|A=0,B=0)

…P(A=0)P(B=1|A=0)P(C=1|A=0)P(E=0|B=1,C=1)P(D=1|A=0,B=1)

∑ = P(A=0, E=0)

Conditioning + Elimination

IDEA: condition until w* of the remaining graph gets small enough!

0 0 0 0

0 1 0 1

E

C

D

B

A 0 1

Search

Elimination

A

B C

ED

P(A)

P(B|A)

P(E|B,C)

P(D|A,B)

P(C|A)

w* = 1 w*

loop cutset

w* = ww* = 0

search w-cutset elimination

?)0( EP

Loop-cutset method• Condition until we get a polytree (no loops)

subset of conditioning variables = loop-cutset

A

B C

ED

B C

ED

A=0 A=0 A=0

B C

ED

A=1 A=1 A=1

P(B|D=0) = P(B,A=0|D=0) + P(B,A=1|D=0)

Loop-cutset method is time exponential in loop-cutset size and linear space!

w-cutset method

• Identify a w-cutset, Cw, of the network Finding smallest loop-cutset/w-cutset is NP-hard

• For each assignment of the cutset, solve by VE the conditioned subproblem

• Aggregate the solutions over all cutset assignments

• Time complexity: exp(|Cw| + w)• Space complexity: exp(w)

Interleaving Conditioning and Elimination


Eliminate

Interleaving Conditioning and EliminationEliminate


Condition


...

...

General graphical models• All algorithms generalize to any graphical

model Through general operations of combination and

marginalization General BE, BTE, CTE, VE+C Applicable to Markov networks, to constraint

optimization, to counting number of solutions in SAT/CSP, etc.



• Conditioning (search) VE+C hybrid Cycle cutset scheme AND/OR search (tree, graph)

Solution techniques

Search: Conditioning

Complete

Incomplete

Gradient Descent

Complete

Incomplete

Tree ClusteringVariable Elimination

Mini-Clustering(i)Mini-Bucket(i)

Stochastic Local SearchDFS search

Inference: Elimination

Time: exp(treewidth)Space:exp(treewidth)

Time: exp(n)Space: linear

AND/OR searchTime: exp(treewidth*log n)Space: linear

Hybrids

Space: exp(treewidth)Time: exp(treewidth)

Time: exp(pathwidth)Space: exp(pathwidth)

Belief Propagation

Bucket Elimination



• Conditioning (search) Cycle cutset VE+C hybrid AND/OR search spaces

• AND/OR tree search• AND/OR graph search

OR search spaceA

D

B C

E

F

Ordering: A B E C D F

A

D

B C

E

F

0 1 0 1 0 1 0 1

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0 1 0 1

E

C

F

D

B

A 0 1

AND/OR search space

AOR

0AND 1

BOR B

0AND 1 0 1

EOR C E C E C E C

OR D F D F D F D F D F D F D F D F

AND 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

AND 0 10 1 0 10 1 0 10 1 0 10 1

A

D

B C

E

F

A

D

B

CE

F

Moral graph DFS tree

A

D

B C

E

F

A

D

B C

E

F

OR vs. AND/ORAOR

0AND 1

BOR B

0AND 1 0 1

EOR C E C E C E C


AND 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

AND 0 10 1 0 10 1 0 10 1 0 10 1

E 0 1 0 1 0 1 0 1

0C 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

F 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

D 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0B 1 0 1

A 0 1

E 0 1 0 1 0 1 0 1

0C 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

F 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

D 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0B 1 0 1

A 0 1

AND/OR

OR

A

D

B C

E

FA

D

B

CE

F

1

1

1

0

1

0

OR vs. AND/OR

92

AOR

0AND 1

BOR B

0AND 1 0 1

EOR C E C E C E C


AND 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

AND 0 10 1 0 10 1 0 10 1 0 10 1

E 0 1 0 1 0 1 0 1

0C 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

F 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

D 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0B 1 0 1

A 0 1

AND/OR

OR

A

D

B C

E

FA

D

B

CE

F

AND/OR size: exp(4), OR size exp(6)

AND/OR search spaces• The AND/OR search tree of R relative to a spanning-tree, T, has:

Alternating levels of: OR nodes (variables) and AND nodes (values)

• Successor function: The successors of OR nodes X are all its consistent values along its path The successors of AND <X,v> are all X child variables in T

• A solution is a consistent subtree• Task: compute the value of the root node

0

A

B

0

E

OR

AND

OR

AND

OR

AND

C

0

OR

AND

D

0 1

F

0 1

1

D

0 1

F

0 1

0 1

1

E C

0

D

0 1

F

0 1

1

D

0 1

F

0 1

0 1

1

B

0

E C

0

D

0 1

F

0 1

1

D

0 1

F

0 1

0 1

1

E C

0

D

0 1

F

0 1

1

D

0 1

F

0 1

0 1

A

D

B C

E

F

A

D

B

CE

F

From DFS trees to pseudo trees

(a) Graph

4 61

3 2 7 5

(b) DFS treedepth=3

(c) Pseudo treedepth=2

(d) Chaindepth=6

4 6

1

3

2 7

5 2 7

1

4

3 5

6

4

6

1

3

2

7

5

(Freuder85, Bayardo&Miranker95)

Pseudo tree vs. DFS tree

Model (DAG) w* Pseudo tree avg. depth

DFS tree avg. depth

(N=50, P=2) 9.54 16.82 36.03(N=50, P=3) 16.1 23.34 40.6(N=50, P=4) 20.91 28.31 43.19(N=100, P=2) 18.3 27.59 72.36(N=100, P=3) 30.97 41.12 80.47(N=100, P=4) 40.27 50.53 86.54

N = number of nodes, P = number of parents. MIN-FILL ordering. 100 instances.

Finding min-depth backbone trees• Finding min depth DFS, or pseudo tree is NP-

complete, but:• Given a tree-decomposition whose treewidth

is w*, there exists a pseudo tree T of G whose depth, satisfies:

m <= w* log n

(Bayardo & Miranker96, Bodlaender & Gilbert91)

Generating pseudo trees from bucket trees

FAC

B DE

E

A

C

B

D

F (AF) (EF)

(A)

(AB)

(AC) (BC)

(AE)

(BD) (DE)

Bucket-tree based on dd: A B C E D F

E

A

C

B

D

F

Induced graph

E

A

C

B

D F

Bucket-tree used as pseudo tree

AND

AND

AND

AND

0 1

BOR B

0 1 0 1

COR E C E C E C E


0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0 10 1 0 10 1 0 10 1 0 10 1

AOR

AND/OR search tree

Bucket-tree

ABE

A

ABC

AB

BDE AEF

bucket-A

bucket-E

bucket-B

bucket-C

bucket-D bucket-F

(AE) (BE)

Other heuristics for pseudo trees• Depth-first traversal of the induced graph

constructed along some elimination ordering (e.g., min-fill) Sometimes can get slightly different trees than those

obtained from the bucket-tree

• Recursive decomposition of the dual hypergraph while minimizing the separator size at each step Functions (CPTs) are vertices in the dual hypergraph,

while variables are hyperedges Separator = set of hyperedges (i.e., variables)

Quality of the pseudo trees

Network hypergraph min-fill width depth width depthbarley 7 13 7 23diabetes 7 16 4 77link 21 40 15 53mildew 5 9 4 13munin1 12 17 12 29munin2 9 16 9 32munin3 9 15 9 30munin4 9 18 9 30water 11 16 10 15pigs 11 20 11 26

Bayesian Networks Repository

AND/OR search tree properties• Theorem: Any AND/OR search tree based on a pseudo tree is

sound and complete (expresses all and only solutions)

• Theorem: Size of AND/OR search tree is O(n km)Size of OR search tree is O(kn)

• Theorem: Size of AND/OR search tree can be bounded by O(exp(w* log n))

• Related to: (Freuder85; Dechter90, Bayardo et. al. 96, Darwiche01, Bacchus et. al. 03)

• When the pseudo-tree is a chain we get an OR space

AND/OR vs. OR spaces

width depthOR space AND/OR space

Time (sec.) Nodes Time (sec.) AND nodes OR nodes

5 10 3.15 2,097,150 0.03 10,494 5,247

4 9 3.13 2,097,150 0.01 5,102 2,551

5 10 3.12 2,097,150 0.03 8,926 4,463

4 10 3.12 2,097,150 0.02 7,806 3,903

5 13 3.11 2,097,150 0.10 36,510 18,255

Random graphs with 20 nodes, 20 edges and 2 values per node

Tasks and values of nodes• v(n) is the value of the tree T(n) for the task:

Optimization (MPE): v(n) is the optimal solution in T(n) Belief updating: v(n), probability of evidence in T(n).

• Goal: compute the value of the root node recursively using DFS search of the AND/OR tree.

• Theorem: Complexity of AO DFS search is: Space: O(n) Time: O(n km) Time: O(exp(w* log n))

Weighted AND/OR tree (belief updating)

0

A

B

0

E

OR

AND

OR

AND

OR

AND

C

0

OR

AND

D

0 1

1

D

0 1

0 1

1

E C

0

D

0 1

1

D

0 1

0 1

1

B

0

E C

0

D

0 1

1

D

0 1

0 1

1

E C

0

D

0 1

1

D

0 1

0 1

A

D

B C

E

A

D

B

CE

B C D=0 D=10 0 .2 .80 1 .1 .91 0 .3 .71 1 .5 .5

),|( CBDP

.7.8 .9 .5 .7.8 .9 .5

Evidence: D=1

A B E=0 E=10 0 .4 .60 1 .5 .51 0 .7 .31 1 .2 .8

),|( BAEP

Evidence: E=0

.4 .5 .7 .2

A B=0 B=10 .4 .61 .1 .9

)|( ABPA C=0 C=10 .2 .81 .7 .3

)|( ACPA P(A)0 .61 .4

)(AP

.2 .8 .2 .8 .1 .9 .1 .9

.4 .6 .1 .9

.6 .4

D: P(D|B,C) D=1C: P(C|A)E: P(E|A,B) E=0B: P(B|A)A: P(A)

w(X,x) = product of CPTs that contain X and their scope is fully instantiated along the path

Computing node values (belief updating)

k

i

iviwv1

),A(),A(AOR node

1

A

2 k

w(A,1)w(A,2)

w(A,k)

v(A,1) v(A,1) v(A,1)…

AND node

0

X1 X2 Xm…v(X1) v(X2) v(Xm)

m

iiXvv

1

0,A

NOTE: • the value of a terminal AND node is 1• the weight of an OR-AND arc for which no CPTs are fully instantiated is 1

AND/OR tree algorithm (belief updating)

AND node: Combination operator (product)

OR node: Marginalization operator (summation)

Value of node = updated belief for sub-problem below

0

A

B

0

E

OR

AND

OR

AND

OR

AND

C

0

OR

AND

D

0 1

1

D

0 1

0 1

1

E C

0

D

0 1

1

D

0 1

0 1

1

B

0

E C

0

D

0 1

1

D

0 1

0 1

1

E C

0

D

0 1

1

D

0 1

0 1

A

D

B C

E

A

D

B

CE

B C D=0 D=10 0 .2 .80 1 .1 .91 0 .3 .71 1 .5 .5

),|( CBDP

.7.8 .9 .5 .7.8 .9 .5

Evidence: D=1

A B E=0 E=10 0 .4 .60 1 .5 .51 0 .7 .31 1 .2 .8

),|( BAEP

Evidence: E=0

.4 .5 .7 .2

A B=0 B=10 .4 .61 .1 .9

)|( ABPA C=0 C=10 .2 .81 .7 .3

)|( ACPA P(A)0 .61 .4

)(AP

.2 .8 .2 .8 .1 .9 .1 .9

.4 .6 .1 .9

.6 .4

.8 .9

.8 .9

.7 .5

.7 .5

.8 .9

.8 .9

.7 .5

.7 .5

.4 .5 .7 .2.88 .54 .89 .52

.352 .27 .623 .104

.3028 .1559

.24408

.3028 .1559

Result: P(D=1,E=0)

0.3028*0.6 + 0.1559*0.4 = 0.24408

Complexity of AND/OR tree search

AND/OR tree OR tree

Space O(n) O(n)

TimeO(n km)

O(n kw* log n)(Freuder & Quinn85), (Collin, Dechter & Katz91), (Bayardo & Miranker95), (Darwiche01)

O(kn)

k = domain sizem = depth of pseudo-treen = number of variablesw*= treewidth



• Conditioning (search) VE+C hybrid AND/OR search spaces


From search trees to search graphs• Any two nodes that root identical sub-trees or

sub-graphs can be merged

AND/OR search treeA

D

B C

E

F

G H

J

K

A

D

B

CE

F

G

H

J

KAOR

0AND 1

BOR B

0AND 1 0 1

EOR C E C E C E C


AND

AND 0 10 1 0 10 1 0 10 1 0 10 1

OR

OR

AND

AND

0

G

H H

0 1 0 1

0 1

1

G

H H

0 1 0 1

0 1

0

J

K K

0 1 0 1

0 1

1

J

K K

0 1 0 1

0 1

0

G

H H

0 1 0 1

0 1

1

G

H H

0 1 0 1

0 1

0

J

K K

0 1 0 1

0 1

1

J

K K

0 1 0 1

0 1

0

G

H H

0 1 0 1

0 1

1

G

H H

0 1 0 1

0 1

0

J

K K

0 1 0 1

0 1

1

J

K K

0 1 0 1

0 1

0

G

H H

0 1 0 1

0 1

1

G

H H

0 1 0 1

0 1

0

J

K K

0 1 0 1

0 1

1

J

K K

0 1 0 1

0 1

0

G

H H

0 1 0 1

0 1

1

G

H H

0 1 0 1

0 1

0

J

K K

0 1 0 1

0 1

1

J

K K

0 1 0 1

0 1

0

G

H H

0 1 0 1

0 1

1

G

H H

0 1 0 1

0 1

0

J

K K

0 1 0 1

0 1

1

J

K K

0 1 0 1

0 1

0

G

H H

0 1 0 1

0 1

1

G

H H

0 1 0 1

0 1

0

J

K K

0 1 0 1

0 1

1

J

K K

0 1 0 1

0 1

0

G

H H

0 1 0 1

0 1

1

G

H H

0 1 0 1

0 1

0

J

K K

0 1 0 1

0 1

1

J

K K

0 1 0 1

0 1

AND/OR search graph

AOR

0AND 1

BOR B

0AND 1 0 1

EOR C E C E C E C


AND

AND 0 10 1 0 10 1 0 10 1 0 10 1

OR

OR

AND

AND

0

G

H H

0 1 0 1

0 1

1

G

H H

0 1 0 1

0 1

0

J

K K

0 1 0 1

0 1

1

J

K K

0 1 0 1

0 1

A

D

B C

E

F

G H

J

K

A

D

B

CE

F

G

H

J

K

Merging based on context• One way of recognizing nodes that can be merged

context(X) = ancestors of X in the pseudo tree that are connected to X, or to descendants

of X

[ ]

[A]

[AB]

[AE][BC]

[AB]

A

D

B

EC

F

pseudo tree

A

E

C

B

F

D

A

E

C

B

F

D

AND/OR graph algorithm (belief updating)

.7.8

0

A

B

0

E C

0

D

0 1

1

D

0 1

0 1

1

E C

0

D

0 1

1

D

0 1

0 1

1

B

0

E C

0 10 1

1

E C

0 10 1

A

D

B C

E

B C D=0 D=10 0 .2 .80 1 .1 .91 0 .3 .71 1 .5 .5

),|( CBDP.7.8 .9 .5

A B E=0 E=10 0 .4 .60 1 .5 .51 0 .7 .31 1 .2 .8

),|( BAEP

Evidence: E=0

.4 .5 .7 .2

A B=0 B=10 .4 .61 .1 .9

)|( ABPA C=0 C=10 .2 .81 .7 .3

)|( ACPA P(A)0 .61 .4

)(AP

.2 .8 .2 .8 .1 .9 .1 .9

.4 .6 .1 .9

.6 .4

.9

.8 .9

.5

.7 .5 .8 .9 .7 .5

.4 .5 .7 .2.88 .54 .89 .52

.352 .27 .623 .104

.3028 .1559

.24408

.3028 .1559

A

D

B

CE

[ ]

[A]

[AB]

[BC]

[AB]

Context

B C Value0 0 .80 1 .91 0 .71 1 .1

Cache table for D

Result: P(D=1,E=0)

Context-minimal AND/OR graphC

0

K

0

H

0

L0 1

N N

0 1 0 1

F F F

1 1

0 1 0 1

F

G

0 1

1

A0 1

B B

0 1 0 1

E EE E

0 1 0 1

J JJ J

0 1 0 1

A0 1

B B

0 1 0 1

E EE E

0 1 0 1

J JJ J

0 1 0 1

G

0 1

G

0 1

G

0 1

M

0 1

M

0 1

M

0 1

M

0 1

P

0 1

P

0 1

O

0 1

O

0 1

O

0 1

O

0 1

L0 1

N N

0 1 0 1

P

0 1

P

0 1

O

0 1

O

0 1

O

0 1

O

0 1

D

0 1

D

0 1

D

0 1

D

0 1

K

0

H

0

L0 1

N N

0 1 0 1

1 1

A0 1

B B

0 1 0 1

E EE E

0 1 0 1

J JJ J

0 1 0 1

A0 1

B B

0 1 0 1

E EE E

0 1 0 1

J JJ J

0 1 0 1

P

0 1

P

0 1

O

0 1

O

0 1

O

0 1

O

0 1

L0 1

N N

0 1 0 1

P

0 1

P

0 1

O

0 1

O

0 1

O

0 1

O

0 1

D

0 1

D

0 1

D

0 1

D

0 1

B A

C

E

F G

HJ

D

K M

L

N

OP

C

HK

D

M

F

G

A

B

E

J

O

L

N

P

[AB]

[AF][CHAE]

[CEJ]

[CD]

[CHAB]

[CHA]

[CH]

[C]

[ ]

[CKO]

[CKLN]

[CKL]

[CK]

[C]

(C K H A B E J L N O D P M F G)

How big is the context?

Theorem: The maximum context size for a pseudo tree is equal to the treewidth of the graph along the pseudo tree.

C

HK

D

M

F

G

A

B

E

J

O

L

N

P

[AB]

[AF][CHAE]

[CEJ]

[CD]

[CHAB]

[CHA]

[CH]

[C]

[ ]

[CKO]

[CKLN]

[CKL]

[CK]

[C]

(C K H A B E J L N O D P M F G)

B A

C

E

F G

H

J

D

K M

L

N

OP

max context size = treewidth

Treewidth vs. pathwidth

G

EK

F

L

H

C

BA

M

J

D

EK

L

H

C

A

M

J

ABC

BDEF

BDFG

EFH

FHK

HJ KLM

treewidth = 3 = (max cluster size) - 1

ABC

BDEFG

EFH

FHKJ

KLM

pathwidth = 4 = (max cluster size) - 1

D

G

B

F

TREE

CHAIN

AND/OR graph search

• AO(i): searches depth-first, cache i-context i = the max size of a cache table (i.e. number of

variables in a context)

i=0 i=w*

Space: O(n)

Time: O(exp(w* log n))

Space: O(exp w*)

Time: O(exp w*)Space: O(exp(i) )

Time: O(exp(m_i+i )

i

Complexity of AND/OR graph search

AND/OR graph OR graph

Space O(n kw*) O(n kpw*)

Time O(n kw*) O(n kpw*)

k = domain sizen = number of variablesw*= treewidthpw*= pathwidth

w* ≤ pw* ≤ w* log n

Related work• Recursive Conditioning (RC) (Darwiche01)

Can be viewed as an AND/OR graph search algorithm guided by tree

Guiding tree structure is called “dtree”

• Value Elimination (VE) (Bacchus et al.03) Also an AND/OR graph search algorithm using an

advanced caching scheme based on components rather than graph-based contexts

Can use dynamic variable orderings



• Conditioning (search) VE+C hybrid AND/OR search spaces


AND/OR w-cutset

A

C

B K

G L

D F

H

M

J

E

AC

B K

G

L

D

FH

M

J

E

A

C

B K

G L

D F

H

M

J

E

C

B K

G

L

D

FH

M

J

E

3-cutset

A

C

B K

G L

D F

H

M

J

E

C

K

G

L

D

FH

M

J

E

2-cutset

A

C

B K

G L

D F

H

M

J

E

L

D

FH

M

J

E

1-cutset

AND/OR w-cutset

AC

B K

G

L

D

FH

M

J

E

AC

B K

G

L

D

FH

M

J

E

AC

B K

G

L

D

FH

M

J

E

pseudo tree 1-cutset treemoral graph

Searching AND/OR graphs

• AO(i): searches depth-first, cache i-context i = the max size of a cache table (i.e. number of

variables in a context)

i=0 i=w*

Space: O(n)

Time: O(exp(w* log n))

Space: O(exp w*)

Time: O(exp w*)Space: O(exp(i) )

Time: O(exp(m_i+i )

i

w-cutset trees over AND/OR space

• Definition: T_w is a w-cutset tree relative to backbone pseudo tree T, iff T_w roots

T and when removed, yields treewidth w.

• Theorem: AO(i) time complexity for backbone T is time O(exp(i+m_i)) and space

O(i), m_i is the depth of the T_i tree.

• Better than w-cutset: O(exp(i+c_i)) where c_i is the number of nodes in T_i



• Conditioning (search) VE+C hybrid AND/OR search for Most Probable Explanations

AND/OR Branch-and-Bound for MPE

• Solved by BE in time and space exponential in treewidth w*

• Solved by Conditioning in linear space and time exponential in the number of variables n

• It can be solved by AND/OR search: Tree search: space O(n), time O(exp(w* log n)) Graph search: time and space O(exp(w*))

n

iiiXX paXPMPE

n1

,..., |max1

Weighted AND/OR tree (MPE task)

0

A

B

0

E

OR

AND

OR

AND

OR

AND

C

0

OR

AND

D

0 1

1

D

0 1

0 1

1

E C

0

D

0 1

1

D

0 1

0 1

1

B

0

E C

0

D

0 1

1

D

0 1

0 1

1

E C

0

D

0 1

1

D

0 1

0 1

A

D

B C

E

A

D

B

CE

B C D=0 D=10 0 .2 .80 1 .1 .91 0 .3 .71 1 .5 .5

),|( CBDP

.7.8 .9 .5 .7.8 .9 .5

Evidence: D=1

A B E=0 E=10 0 .4 .60 1 .5 .51 0 .7 .31 1 .2 .8

),|( BAEP

Evidence: E=0

.4 .5 .7 .2

A B=0 B=10 .4 .61 .1 .9

)|( ABPA C=0 C=10 .2 .81 .7 .3

)|( ACPA P(A)0 .61 .4

)(AP

.2 .8 .2 .8 .1 .9 .1 .9

.4 .6 .1 .9

.6 .4

D: P(D|B,C) D=1C: P(C|A)E: P(E|A,B) E=0B: P(B|A)A: P(A)

w(X,x) = product of CPTs that contain X and their scope is fully instantiated along the path

Computing node values (MPE task)

),(),(maxA iAviAwv i OR node

1

A

2 k

w(A,1)w(A,2)

w(A,k)

v(A,1) v(A,1) v(A,1)…

AND node

0

X1 X2 Xm…v(X1) v(X2) v(Xm)

m

iiXvv

1

0,A

NOTE: • the value of a terminal AND node is 1• the weight of an OR-AND arc for which no CPTs are fully instantiated is 1

AND/OR tree algorithm (MPE task)


OR node: Marginalization operator (maximization)

Value of node = MPE value for sub-problem below

0

A

B

0

E

OR

AND

OR

AND

OR

AND

C

0

OR

AND

D

0 1

1

D

0 1

0 1

1

E C

0

D

0 1

1

D

0 1

0 1

1

B

0

E C

0

D

0 1

1

D

0 1

0 1

1

E C

0

D

0 1

1

D

0 1

0 1

A

D

B C

E

A

D

B

CE

B C D=0 D=10 0 .2 .80 1 .1 .91 0 .3 .71 1 .5 .5

),|( CBDP

.7.8 .9 .5 .7.8 .9 .5

Evidence: D=1

A B E=0 E=10 0 .4 .60 1 .5 .51 0 .7 .31 1 .2 .8

),|( BAEP

Evidence: E=0

.4 .5 .7 .2

A B=0 B=10 .4 .61 .1 .9

)|( ABPA C=0 C=10 .2 .81 .7 .3

)|( ACPA P(A)0 .61 .4

)(AP

.2 .8 .2 .8 .1 .9 .1 .9

.4 .6 .1 .9

.6 .4

.8 .9

.8 .9

.7 .5

.7 .5

.8 .9

.8 .9

.7 .5

.7 .5

.4 .5 .7 .2.72 .40 .81 .45

.288 .20 .567 .09

.12 .081

.072

.12 .081

Result: MPE(D=1,E=0)

MAX( 0.12*0.6 , 0.081*0.4 )= 0.072

Branch-and-Bound search

n

g(n) cost of thesearch path to n

h(n) estimates theoptimal cost below n

UB(n) = g(n) * h(n)

Upper Bound UB(n)

OR Search Tree

Prune if UB(n) ≤ LB

Lower Bound LB

(Lawler & Wood66)

Partial solution tree

0

D

0

(A=0, B=0, C=0, D=0)

0

A

B C

0

0

A

B C

00

D

1

(A=0, B=0, C=0, D=1)

0

A

B C

01

D

0

(A=0, B=1, C=0, D=0)

0

A

B C

01

D

1

(A=0, B=1, C=0, D=1)

A

B C

D

Pseudo tree

Extension(T’) – solution trees that extend T’

Exact evaluation function

OR

AND

OR

AND

OR

OR

AND

AND

A

0

B

0

D

E E

0 1 0 1

0 1

C

1

1

6 4 8 54 5

4 5

2 4

9

9

2 5 0 0

0 0

0

1

0

0

D

0

C

1

v(D,0)

3

3 5 00

9

tip nodes

F

1

3

3 50

F v(F)

A

B

C

D E

F

A

B

C D

E

F

A B C f1(ABC)

0 0 0 20 0 1 50 1 0 30 1 1 51 0 0 91 0 1 31 1 0 71 1 1 2

A B F f2(ABF)

0 0 0 30 0 1 50 1 0 10 1 1 41 0 0 61 0 1 51 1 0 61 1 1 5

B D E f3(BDE)

0 0 0 60 0 1 40 1 0 80 1 1 51 0 0 91 0 1 31 1 0 71 1 1 4

f*(T’) = w(A,0) * w(B,1) * w(C,0) * w(D,0) * v(D,0) * v(F)

Heuristic evaluation function

OR

AND

OR

AND

OR

OR

AND

AND

A

0

B

0

D

E E

0 1 0 1

0 1

C

1

1

6 4 8 54 5

4 5

2 4

9

9

2 5 0 0

0 0

0

1

0

0

D

0

C

1

h(D,0) = 4

3

3 5 00

9

tip nodes

F

1

3

3 50

F h(F) = 5

A

B

C

D E

F

A

B

C D

E

F

A B C f1(ABC)

0 0 0 20 0 1 50 1 0 30 1 1 51 0 0 91 0 1 31 1 0 71 1 1 2

A B F f2(ABF)

0 0 0 30 0 1 50 1 0 10 1 1 41 0 0 61 0 1 51 1 0 61 1 1 5

B D E f3(BDE)

0 0 0 60 0 1 40 1 0 80 1 1 51 0 0 91 0 1 31 1 0 71 1 1 4

f(T’) = w(A,0) * w(B,1) * w(C,0) * w(D,0) * h(D,0) * h(F) ≥ f*(T’)

h(n) ≥ v(n)

AND/OR Branch-and-Bound searchOR

AND

OR

AND

OR

OR

AND

AND

A

0

B

0

D

E E

0 1 0 1

0 1

C

1

1

1

0

D

E E

0 1 0 1

0 1

C

10

B

0 1

f(T’) ≤ LB

LB (Marinescu and Dechter, 05)

AND/OR Branch-and-Bound search• Associate each node n with a heuristic upper

bound h(n) on v(n)• EXPAND (top-down)

Evaluate f(T’) of the current partial solution sub-tree T’, and prune search if f(T’) ≤ LB

Expand the tip node n by generating its successors• PROPAGATE (bottom-up)

Update value of the parent p of n• OR nodes: maximization• AND nodes: product

How to Generate Heuristics• The principle of relaxed models

Mini-Bucket Elimination for belief networks(Pearl86)

Grid Networks (BN)

Grid(w*, h)(n, e)

SamIamv. 2.3.2

time

MBE(i)BB+SMB(i)

AOBB+SMB(i)BB+DMB(i)

AOBB+DMB(i)i=10

MBE(i)BB+SMB(i)


AOBB+DMB(i)i=14

MBE(i)BB+SMB(i)


AOBB+DMB(i)i=18

MBE(i)BB+SMB(i)


AOBB+DMB(i)i=20

time nodes time nodes time nodes time nodes 90-24-1(36, 61)(576, 20)

-

0.14----

----

0.89-

1500.66--

-

24,117,151--

7.61-

93.73-

1979.42

-

1,413,764-

1,228

31.26-

111.46-

2637.71

-

1,308,009-

598 90-26-1(35, 64)(676, 40)

-

0.16-

1533.11-

1852.27

-

17,899,574-

177,661

1.02-

242.37--

-

3,205,257--

11.74-

21.48-

2889.49

-

165,182-

1,191

36.1670.5336.49

--

327,859

5,777--

90-30-1(38, 68)(900, 60)

-

0.25----

----

1.35-

239.08--

-

3,324,942--

13.34-

101.10--

-

1,358,569--

50.53-

87.68--

-

485,300--

Min-fill pseudo tree. Time limit 1 hour.

(Sang et al.05)

Genetic Linkage Analysis (BN)

pedigree(n, d)(w*, h)

Superlink

v. 1.6

time

SamIam

v. 2.3.2

time

MBE(i)BB+SMB(i)

AOBB+SMB(i)i=12

MBE(i)BB+SMB(i)

AOBB+SMB(i)i=16

MBE(i)BB+SMB(i)

AOBB+SMB(i)i=20

time nodes time nodes time nodesped18(1184, 5)(21, 119)

139.06

157.05

0.51--

--

4.59-

270.96

-

2,555,078

19.30-

20.27

-

7,689ped25(994, 5)(29, 53)

-

out

0.34--

--

3.20--

--

33.42-

1894.17

-

11,709,153ped30(1016, 5)(25, 51)

13095.83

out

0.31-

5563.22

-

63,068,960

2.66-

1811.34

-

20,275,620

24.88-

82.25

-

588,558ped33(581, 5)(26, 48)

-

out

0.41-

2335.28

-

32,444,818

5.28-

62.91

-

807,071

51.24-

76.47

-

320,279ped39(1272, 5)(23, 94)

322.14

out

0.52--

--

8.41-

4041.56

-

52,804,044

81.27-

141.23

-

407,280

(Fishelson&Geiger02)

Min-fill pseudo tree. Time limit 3 hours.

Memory intensive AND/OR Branch-and-Bound

• Associate each node n with a heuristic upper bound h(n) on v(n)

• EXPAND (top-down) Evaluate f(T’) of the current partial solution sub-tree T’, and

prune search if f(T’) ≤ LB If not in cache, expand the tip node n by generating its

successors• PROPAGATE (bottom-up)

Update value of the parent p of n• OR nodes: maximization• AND nodes: multiplication

Cache value of n, based on context

Best-first AND/OR search for MPE• Best-first search expands first the node with

the best heuristic evaluation function among all nodes encountered so far

• It never expands nodes whose cost is beyond the optimal one, unlike depth-first search algorithms (Dechter & Pearl85)

• Superior among memory intensive algorithms employing the same heuristic function

Best-First AND/OR Search• Maintains the set of best partial solution trees• EXPAND (top-down)

Traces down marked connectors from root (best partial solution tree) Expands a tip node n by generating its successors n’ Associate each successor with heuristic estimate h(n’)

• Initialize v(n’) = h(n’)

• REVISE (bottom-up) Updates node values v(n)

• OR nodes: maximization• AND nodes: multiplication

Marks the most promising solution tree from the root Label the nodes as SOLVED:

• OR is SOLVED if marked child is SOLVED• AND is SOLVED if all children are SOLVED

• Terminate when root node is SOLVED

[specializes Nilsson’s AO* to graphical models (Nilsson80)]

(Marinescu & Dechter, 07)

Grid Networks (BN)

grid (w*, h)(n, e)

SamIam

MBE(i)BB-C+SMB(i)AOBB+SMB(i)

AOBB-C+SMB(i)AOBF-C+SMB(i)

i=12



i=14



i=16



i=18

time nodes time nodes time nodes time nodes 90-24-1(33, 111)(576, 20)

out

0.28---

out

---

0.64-

2338.671273.09

21.94

-

24,117,1519,047,518

75,637

1.69-

1548.09596.27

10.59

-

18,238,9834,923,760

33,770

4.60-

138.6770.42

6.06

-

1,413,764473,675

5,144 90-34-1(45, 153)(1154, 80)

out

0.63---

out

---

1.25---

out

---

3.72---

243.63

---

596,978

11.66---

270.88

---

667,013 90-38-1(47, 163)(1444, 120)

out

0.78-

2032.33969.02101.69

-

6,835,7452,623,971

174,786

1.67--

1753.10103.80

--

3,794,053146,237

4.20-

807.38203.67

54.00

-

2,850,393614,868

95,511

12.36-

568.69165.45

53.44

-

2,079,146488,873

78,431Min-fill pseudo tree. Time limit 1 hour.

Solving the MAP task

• Solved by BE in time and space exponential in constrained induced width w*

• Solved by AND/OR search: Tree search: space O(n), time O(exp(w* log n)) Graph search: time and space O(exp(w*))

AXa

k exPaa\

**1 ),(maxarg),...,(MAP

Bucket elimination for MAP

A

B C

ED

P(A)

P(B|A)

P(E|B,C)

P(D|A,B)

P(C|A)

A

B C

ED

Moralize (marry parents)

Variables A and B are the hypothesis variables, variable E is evidence

0,,,, ),,,,(max0,,maxedcbaba edcbaPebaPMAP

c d eba cbePbadPacPabPaPMAP0

),|(),|()|()|(max)(max

Bucket elimination for MAP

Bucket E:

Bucket D:

Bucket C:

Bucket B:

Bucket A:

P(E|B,C), E = 0

P(D|A,B)

P(A)

λE(B, C)

λC(A,B)λD(A, B)

λB(A)

MAP value

P(C|A)

P(B|A)

SUM buckets

MAX buckets

Bucket elimination for MAP• Elimination order is important: SUM variables are eliminated

first, followed by the MAX variables ordering: A, B, C, D, E is legal ordering: A, C, D, E, B is illegal

• Induced width corresponding to a legal elimination order is called constrained induced width cw* Typically it may be far larger than the unconstrained induced width,

ie cw* ≥ w*

• When interleaving MAX and SUM (using unconstrained orderings) the result is an Upper Bound on the MAP value Can be used as a guiding heuristic function for search

AND/OR tree algorithm for MAP


OR node: MAX for hypothesis, SUM otherwise

0

A

B

0

E

OR

AND

OR

AND

OR

AND

C

0

OR

AND

D

0 1

1

D

0 1

0 1

1

E C

0

D

0 1

1

D

0 1

0 1

1

B

0

E C

0

D

0 1

1

D

0 1

0 1

1

E C

0

D

0 1

1

D

0 1

0 1

A

D

B C

E

A

D

B

CE

B C D=0 D=10 0 .2 .80 1 .1 .91 0 .3 .71 1 .5 .5

),|( CBDP

.7.8 .9 .5 .7.8 .9 .5

Evidence: D=1

A B E=0 E=10 0 .4 .60 1 .5 .51 0 .7 .31 1 .2 .8

),|( BAEP

Evidence: E=0

.4 .5 .7 .2

A B=0 B=10 .4 .61 .1 .9

)|( ABPA C=0 C=10 .2 .81 .7 .3

)|( ACPA P(A)0 .61 .4

)(AP

.2 .8 .2 .8 .1 .9 .1 .9

.4 .6 .1 .9

.6 .4

.8 .9

.8 .9

.7 .5

.7 .5

.8 .9

.8 .9

.7 .5

.7 .5

.4 .5 .7 .2.88 .54 .89 .52

.352 .27 .623 .104

.162 .0936

.0972

.162 .0936

Result: MAP(D=1,E=0)

MAX( 0.162*0.6 , 0.0936*0.4 )= 0.0972

AND/OR search for MAP• Pseudo tree must be consistent with the

constrained elimination order• Graph search via context-based caching

• Time and space complexity Tree search:

• Space linear, time O(exp(cw*log n)) Graph search:

• Time and space O(exp(cw*))

Outline• Probabilistic modeling with joint distributions• Conditional independence and factorization• Belief networks• Inference in belief networks


Approximate inference• Mini-Bucket Elimination

Mini-clustering• Iterative Belief Propagation

IJGP – Iterative Joint Graph Propagation• Sampling

Forward sampling Gibbs sampling (MCMC) Importance sampling

Solution techniques

Search: Conditioning

Complete

Incomplete

Gradient Descent

Complete

Incomplete

Tree ClusteringVariable Elimination

Mini-Clustering(i)Mini-Bucket(i)

Stochastic Local SearchDFS search

Inference: Elimination

Time: exp(treewidth)Space:exp(treewidth)

Time: exp(n)Space: linear

AND/OR searchTime: exp(treewidth*log n)Space: linear

Hybrids

Space: exp(treewidth)Time: exp(treewidth)

Time: exp(pathwidth)Space: exp(pathwidth)

Belief Propagation

Bucket Elimination

Bucket elimination (MPE)

Bucket B:

Bucket C:

Bucket D:

Bucket E:

Bucket A:


P(C|A)

E=0

P(A)

max∏ Elimination operator

λB(A,D,C,E)

λC(A,D,E)

λD(A,E)

λE(A)

MPE

B

C

D

E

A


width

4

3

1

1

0

MBE: Mini-Bucket Elimination• Computation in a bucket is time and space

exponential in the number of variables involved (i.e., width)

• Therefore, partition functions in a bucket into “mini-buckets” on smaller number of variables

• The idea is similar to i-consistency: bound the size of recorded dependencies (Dechter 2003)

Idea: MPE task

XX gh

Split a bucket into mini-buckets => bound complexity

)()()O(e :decrease complexity lExponentia n rnr eOeO

MBE(i=3) in action for MPE

Bucket B:

Bucket C:

Bucket D:

Bucket E:

Bucket A:

P(E|B,C) P(D|A,B), P(B|A)

P(C|A)

E=0

P(A)

λB(C,E)

λC(A,D,E)

Upper Bound on MPE value

λE(A)

λB(A,D)

λD(A,E)

4 variables: split

3 variables: OK

3 variables: OK

2 variables: OK

1 variable: OK

Mini-bucketsmax∏max∏

MBE(i=3) in action for MPEBucket B:

Bucket C:

Bucket D:

Bucket E:

Bucket A:


P(C|A)

E=0

P(A)

λB(C,E)

λC(A,D,E)

λE(A)

λB(A,D)

λD(A,E)

a’ = argmax P(A) ∙ λE(A)

e’ = 0

d’ = argmax λC(a’,D,e’) ∙ ∙ λC(a’,D)

c’ = argmax P(C|a’) ∙∙ λC(C,e’)

b’ = argmax P(e’|B,c’) ∙ ∙ P(d’|a’,B) P(B|a’)∙

Return (a’, b’, c’, d’, e’)A Lower Bound can also be computed as the probability of

the sub-optimal assignment P(a’, b’, c’, d’, e’)

MBE(i=3) for probability of evidence

Bucket B:

Bucket C:

Bucket D:

Bucket E:

Bucket A:

P(E|B,C) P(D|A,B), P(B|A)

P(C|A)

E=0

P(A)

λB(C,E)

λC(A,D,E)

Upper Bound on P(evidence)

λE(A)

λB(A,D)

λD(A,E)

4 variables: split

3 variables: OK

3 variables: OK

2 variables: OK

1 variable: OK

Mini-buckets∑∏∑∏

MBE(i) for probability of evidence• If we process all mini-buckets by summation

then we get an unnecessarily large upper bound on the probability of evidence

• Tighter upper bound Process first mini-bucket by summation and

remaining ones by maximization• We can also get a lower bound on P(evidence)

Process first mini-bucket by summation and remaining ones by minimization

Properties of MBE(i)• Controlling parameter i (called i-bound)

Maximum number of distinct variables in a mini-bucket Outputs both a lower and an upper bound

• Complexity: O(exp(i)) time and space• As i-bound increases, both accuracy and time complexity

increase Clearly, if i = w*, then we have pure BE

• Possible use of mini-bucket approximations As anytime algorithms (Dechter & Rish, 1997) As heuristic functions for depth-first and best-first search (Kask

& Dechter, 2001), (Marinescu & Dechter, 2005)

Mini-Bucket Heuristics• Static Mini-Buckets

Pre-compiled Reduced overhead Less accurate Static variable ordering

• Dynamic Mini-Buckets Computed dynamically Higher overhead High accuracy Dynamic variable ordering

Heuristic evaluation function

OR

AND

OR

AND

OR

OR

AND

AND

A

0

B

0

D

E E

0 1 0 1

0 1

C

1

1

6 4 8 54 5

4 5

2 4

9

9

2 5 0 0

0 0

0

1

0

0

D

0

C

1

h(D,0) = 4

3

3 5 00

9

tip nodes

F

1

3

3 50

F h(F) = 5

A

B

C

D E

F

A

B

C D

E

F

A B C f1(ABC)

0 0 0 20 0 1 50 1 0 30 1 1 51 0 0 91 0 1 31 1 0 71 1 1 2

A B F f2(ABF)

0 0 0 30 0 1 50 1 0 10 1 1 41 0 0 61 0 1 51 1 0 61 1 1 5

B D E f3(BDE)

0 0 0 60 0 1 40 1 0 80 1 1 51 0 0 91 0 1 31 1 0 71 1 1 4

f(T’) = w(A,0) * w(B,1) * w(C,0) * w(D,0) * h(D,0) * h(F) ≥ f*(T’)

h(n) ≥ v(n)

Bucket eliminationA

f(A,B)B

f(B,C)C f(B,F)F

f(A,G) f(F,G)

Gf(B,E) f(C,E)

Ef(A,D) f(B,D) f(C,D)

D

hG (A,F)

hF (A,B)

hB (A)

hE (B,C)hD (A,B,C)

hC (A,B)

A B

CD

E

F

G

A

B

C F

GD E

Ordering: (A, B, C, D, E, F, G)

h*(a, b, c) = hD(a, b, c) * hE(b, c)

(Dechter99)

Static mini-bucket heuristics

A

f(A,B)B

f(B,C)C f(B,F)F

f(A,G) f(F,G)

Gf(B,E) f(C,E)

Ef(B,D) f(C,D)

D

hG (A,F)

hF (A,B)

hB (A)

hE (B,C)hD (B,C)

hC (B)

hD (A)

f(A,D)D

mini-buckets

A B

CD

E

F

G

A

B

C F

GD E


h(a, b, c) = hD(a) * hD(b, c) * hE(b, c) ≥ h*(a, b, c)

MBE(3)

Dynamic mini-bucket heuristics

A

f(a,b)B

f(b,C)C f(b,F)F

f(a,G) f(F,G)

Gf(b,E) f(C,E)

Ef(a,D) f(b,D) f(C,D)

D

hG (F)

hF ()

hB ()

hE (C)hD (C)

hC ()

A B

CD

E

F

G

A

B

C F

GD E


h(a, b, c) = hD(c) * hE(c) = h*(a, b, c)

MBE(3)

Static vs. Dynamic Mini-Bucket Heuristics

s1196 ISCAS’89 circuit.


Mini-clustering (tree decompositions)• Iterative Belief Propagation


Forward sampling Gibbs sampling (MCMC) Importance sampling Particle filtering

Cluster Tree Elimination (CTE)• Correctness and completeness:

Algorithm CTE is correct, i.e. it computes the exact posterior joint probability of all single variables (or subsets) and the evidence.

• Time complexity: O ( deg (n+N) d w*+1 )

• Space complexity: O ( N d sep)where deg = the maximum degree of a node

n = number of variables (= number of CPTs)N = number of nodes in the tree decompositiond = the maximum domain size of a variablew* = the induced widthsep = the separator size

Cluster Tree Elimination - messages

),|()|()(),()2,1( bacpabpapcbha

A B C p(a), p(b|a), p(c|a,b)

B C D Fp(d|b), p(f|c,d)

h(1,2)(b,c)

B E Fp(e|b,f), h(2,3)(b,f)

E F Gp(g|e,f)

),(),|()|(),( )2,1(,

)3,2( cbhdcfpbdpfbhdc

2

4

1

3

EF

BC

BFsep(2,3)={B,F}

elim(2,3)={C,D}

G

E

F

C D

B

A

Mini-Clustering for belief updating• Motivation:

Time and space complexity of Cluster Tree Elimination depend on the induced width w* of the problem

When the induced width w* is big, CTE algorithm becomes infeasible

• The basic idea: Try to reduce the size of the cluster (the exponent);

partition each cluster into mini-clusters with less variables Accuracy parameter i = maximum number of variables in a

mini-cluster The idea was explored for variable elimination (MBE)

Idea of Mini-Clustering

Split a cluster into mini-clusters => bound complexity

)()( :decrease complexity lExponentia rnrn eOeO)O(e

},...,,,...,{ 11 nrr hhhh )(ucluster

elim

n

iihh

1

},...,{ 1 rhh },...,{ 1 nr hh

elim

n

rii

elim

r

ii hhg

11

gh

Mini-Clustering (MC)

),|()|()(),(1)2,1( bacpabpapcbh

a

A B C p(a), p(b|a), p(c|a,b)

B E Fp(e|b,f)

E F Gp(g|e,f)

2

4

1

3EF

BC

BF

dc

dcfpfh,

2)3,2( ),|()(

),()|()( 1)2,1(

,

1)3,2( cbhbdpbh

dc

dc

dcfpcbhbdpfbh,

1)2,1()3,2( ),|(),()|(),(

),|()|()(),()2,1( bacpabpapcbha

Cluster Tree Elimination Mini-Clustering, i=3

G

E

F

C D

B

A

B C D Fp(d|b), p(f|c,d)2

B C D Fp(d|b), h(1,2)(b,c), p(f|c,d)

sep(2,3) = {B,F}elim(2,3) = {C,D}

B C D C D F p(d|b), h(1,2)(b,c) p(f|c,d)

EF

BF

BC

),|()|()(:),(1)2,1( bacpabpapcbh

a

)2,1(H

),|(max:)(

),()|(:)(

,

2)1,2(

1)2,3(

,

1)1,2(

dcfpch

fbhbdpbh

fd

fd

)1,2(H

),|(max:)(

),()|(:)(

,

2)3,2(

1)2,1(

,

1)3,2(

dcfpfh

cbhbdpbh

dc

dc

)3,2(H

),(),|(:),( 1)3,4(

1)2,3( fehfbepfbh

e

)2,3(H

)()(),|(:),( 2)3,2(

1)3,2(

1)4,3( fhbhfbepfeh

b

)4,3(H

),|(:),(1)3,4( fegGpfeh e)3,4(H

ABC

2

4

1

3 BEF

EFG

BCDF

Mini-Clustering - example

Mini-Clustering• Correctness and completeness:

Algorithm MC(i) computes a bound (or an approximation) on the joint probability P(Xi,e) of each variable and each of its values.

• Time & space complexity: O(exp(i))




Forward sampling Gibbs sampling (MCMC) Importance sampling Particle filtering

Iterative Belief Propagation (IBP)• Belief propagation is exact for poly-trees (Pearl, 1988)• IBP - applying BP iteratively to cyclic networks

• No guarantees for convergence• Works well for many coding networks

)( 12xU

)( 11uX

1U 2U 3U

2X1X )( 12uX

)( 13xU

) BEL(U update :step One

1

Iterative Join-Graph Propagation (IJGP)• IBP is applied to a loopy network iteratively

not an anytime algorithm when it converges, it converges very fast

• MC applies bounded inference along a tree decomposition MC is an anytime algorithm controlled by i-bound MC converges in two passes up and down the tree

• IJGP combines: the iterative feature of IBP the anytime feature of MC

IJGP - The basic idea Apply Cluster Tree Elimination to any join-graph

We commit to graphs that are minimal I-maps

Avoid cycles as long as I-mapness is not violated

Result: use minimal arc-labeled join-graphs

IJGP - ExampleA

D

I

B

E

J

F

G

C

H

A

ABDE

FGI

ABC

BCE

GHIJ

CDEF

FGH

C

H

A C

A AB BC

BE

C

CDE CE

F H

FFG GH H

GI

Belief network The graph IBP works on (dual graph)

Arc-minimal join-graph

A

ABDE

FGI

ABC

BCE

GHIJ

CDEF

FGH

C

H

A C

A AB BC

BE

C

CDE CE

F H

FFG GH H

GI

A

ABDE

FGI

ABC

BCE

GHIJ

CDEF

FGH

C

H

A

AB BC

CDE CE

H

FFG GH

GI

Minimal arc-labeled join-graph

A

ABDE

FGI

ABC

BCE

GHIJ

CDEF

FGH

C

H

A

AB BC

CDE CE

H

FFG GH

GI

A

ABDE

FGI

ABC

BCE

GHIJ

CDEF

FGH

C

H

A

AB BC

CDE CE

H

FF GH

GI

Join-graph decompositions

a) Minimal arc-labeled join graph

b) Join-graph obtained by collapsing nodes of graph a)

c) Minimal arc-labeled join graph

A

ABDE

FGI

ABC

BCE

GHIJ

CDEF

FGH

C

H

A

AB BC

CDE CE

H

FF GH

GI

ABCDE

FGI

BCE

GHIJ

CDEF

FGH

BC

CDE CE

FF GH

GI

ABCDE

FGI

BCE

GHIJ

CDEF

FGH

BC

DE CE

FF GH

GI

Tree decomposition

ABCDE

FGHI GHIJ

CDEF

CDE

F

GHI

a) Minimal arc-labeled join graph

b) Tree decomposition

ABCDE

FGI

BCE

GHIJ

CDEF

FGH

BC

DE CE

FF GH

GI

Join-graphsA

ABDE

FGI

ABC

BCE

GHIJ

CDEF

FGH

C

H

A C

A AB BC

BE

C

CDE CE

F H

FFG GH H

GI

A

ABDE

FGI

ABC

BCE

GHIJ

CDEF

FGH

C

H

A

AB BC

CDE CE

H

FF GH

GI

ABCDE

FGI

BCE

GHIJ

CDEF

FGH

BC

DE CE

FF GH

GI

ABCDE

FGHI GHIJ

CDEF

CDE

F

GHI

more accuracy

less complexity

Message propagationABCDE

FGI

BCE

GHIJ

CDEF

FGH

BC

CDECE

FF GH

GI

ABCDEp(a), p(c), p(b|ac), p(d|abe),p(e|b,c)

h(3,1)(bc)

BCD

CDEF

BC

CDE CE

1 3

2

h(3,1)(bc)

h(1,2)

Minimal arc-labeled: sep(1,2)={D,E} elim(1,2)={A,B,C}

Non-minimal arc-labeled: sep(1,2)={C,D,E} elim(1,2)={A,B}

cba

bchbcepabedpacbpcpapdeh,,

)1,3()2,1( )()|()|()|()()()(

ba

bchbcepabedpacbpcpapcdeh,

)1,3()2,1( )()|()|()|()()()(

Bounded decompositions• We want arc-labeled decompositions such that:

the cluster size (internal width) is bounded by i (the accuracy parameter)

the width of the decomposition as a graph (external width) is as small as possible – closer to a tree

• Possible approaches to build decompositions: partition-based algorithms - inspired by the mini-bucket

decomposition grouping-based algorithms

Partition-based algorithms

G

E

F

C D

B

A

a) schematic mini-bucket(i), i=3 b) minimal arc-labeled join-graph decomposition

CDB

CAB

BA

A

CBP(D|B)

P(C|A,B)

P(A)

BA

P(B|A)

FCD

P(F|C,D)

GFE

EBF

BF

EFP(E|B,F)

P(G|F,E)

B

CD

BF

A

F

G: (GFE)

E: (EBF) (EF)

F: (FCD) (BF)

D: (DB) (CD)

C: (CAB) (CB)

B: (BA) (AB) (B)

A: (A)

IJGP properties• IJGP(i) applies BP to min arc-labeled join-graph, whose cluster

size is bounded by i

• On join-trees IJGP finds exact beliefs!

• IJGP is a Generalized Belief Propagation algorithm (Yedidia, Freeman and Weiss, 2001)

• Complexity of one iteration: time: O(deg•(n+N) •d i+1) space: O(N•d)

Random networks - KL at convergence

evidence=0

Random networks, N=50, K=2, P=3, evid=0, w*=16, 100 instances

i-bound1 2 3 4 5 6 7 8 9 10 11

KL

dist

ance

1e-5

1e-4

1e-3

1e-2

IJGPMCIBP


i-bound

1 2 3 4 5 6 7 8 9 10 11K

L di

stan

ce1e-5

1e-4

1e-3

1e-2

IJGPMCIBP

evidence=5

Random networks - KL vs. iterations

evidence=0 evidence=5


Number of iterations

0 5 10 15 20 25 30 35

KL

dist

ance

1e-5

1e-4

1e-3

1e-2IJGP(2)IJGP(10)IBP

Number of iterations

0 5 10 15 20 25 30 35

KL

dist

ance

1e-5

1e-4

1e-3

1e-2

1e-1 IJGP(2)IJGP(10)IBP


Random networks - Time


i-bound

1 2 3 4 5 6 7 8 9 10 11

Tim

e (s

econ

ds)

0.0

0.2

0.4

0.6

0.8

1.0

IJGP 20 itMCIBP 10 it

IJGP summary• IJGP borrows the iterative feature from IBP and the anytime

virtues of bounded inference from MC

• Empirical evaluation showed the potential of IJGP, which improves with iteration and most of the time with i-bound, and scales up to large networks

• IJGP is almost always superior, often by a high margin, to IBP and MC

• Based on all our experiments, we think that IJGP provides a practical breakthrough to the task of belief updating

• #CSP: can use IJGP to generate solution counts estimates for depth-first Branch-and-Bound search




Forward sampling Gibbs sampling (MCMC) Importance sampling

Approximation algorithms• Structural Approximations

Eliminate some dependencies• Remove edges• Mini-Bucket and Mini-Clustering approaches

• Local Search Approach for optimization tasks: MPE, MAP

• Favorite MAX-CSP/WCSP/WSAT local search solver!

• Sampling Generate random samples and compute values of interest

from samples, not original network

Sampling• Input: Bayesian network with set of nodes X• Sample = a tuple with assigned values

s=(X1=x1,X2=x2,… ,Xk=xk)

• Tuple may include all variables (except evidence) or a subset

• Sampling schemas dictate how to generate samples (tuples)

• Ideally, samples are distributed according to P(X|E)

Sampling fundamentals

dxXxggE )()(

Given a set of variables X = {X1, X2, … Xn} that represent joint probability distribution (X) and some function g(X), we can compute expected value of g(X) :

Sampling from (X)

Given independent, identically distributed samples (iid) S1, S2, …ST from (X), it follows from Strong Law of Large Numbers:

T

ttSg

Tg

1)(1

},...,,{ 21tn

ttt xxxS A sample St is an instantiation:

Sampling basics

• Given random variable X, D(X)={0, 1}• Given P(X) = {0.3, 0.7}• Generate k=10 samples: 0,1,1,1,0,1,1,0,1,0• Approximate P’(X):

}6.0,4.0{)('

6.0106

#)1(#)1('

4.0104

#)0(#)0('

XPsamples

XsamplesXP

samplesXsamplesXP

How to draw a sample ?

• Given random variable X, D(X)={0, 1}• Given P(X) = {0.3, 0.7}• Sample X P (X)

draw random number r [0, 1] If (r < 0.3) then set X=0 Else set X=1

• Can generalize for any domain size

Sampling in BN

• Same idea: generate a set of samples T• Estimate posterior marginal P(Xi|E) from

samples• Challenge: X is a vector and P(X) is a huge

distribution represented by BN• Need to know:

How to generate a new sample ? How many samples T do we need ? How to estimate P(E=e) and P(Xi|e) ?

Sampling algorithms

• Forward Sampling• Gibbs Sampling (MCMC)

Blocking Rao-Blackwellised

• Likelihood Weighting• Importance Sampling• Sequential Monte-Carlo (Particle Filtering) in

Dynamic Bayesian Networks

Forward sampling

• Forward Sampling Case with No evidence E={} Case with Evidence E=e

Forward sampling no evidence (Henrion 1988)

Input: Bayesian networkX= {X1,…,XN}, N- #nodes, T - # samples

Output: T samples Process nodes in topological order – first process the

ancestors of a node, then the node itself:1. For t = 1 to T2. For i = 1 to N3. Xi sample xi

t from P(xi | pai)

Sampling a value

What does it mean to sample xit from P(Xi | pai) ?

• Assume D(Xi)={0,1}• Assume P(Xi | pai) = (0.3, 0.7)

• Draw a random number r from [0,1]If r falls in [0,0.3], set Xi = 0If r falls in [0.3,1], set Xi = 1

0 10.3 r

Forward Sampling-Answering Queries

Task: given T samples {S1,S2,…,Sn} estimate P(Xi = xi) :

TxXsamplesxXP ii

ii)(#)(

Basically, count the proportion of samples where Xi = xi

Forward sampling w/ evidenceInput: Bayesian network

X= {X1,…,XN}, N- #nodesE – evidence, T - # samples

Output: T samples consistent with E1. For t=1 to T2. For i=1 to N3. Xi sample xi

t from P(xi | pai)4. If Xi in E and Xi xi, reject sample: 5. i = 1 and go to step 2

Forward sampling (example)

)|( from Sample 5.otherwise 1, fromstart and

samplereject 0, If .4)|( from Sample .3)|( from Sample .2

)( from Sample .1 sample generate//

0 :Evidence

3,244

3

133

122

11

3

xxxPx

xxxPxxxPx

xPxk

X

X1

X4

X2X3

)( 1xP

)|( 12 xxP

),|( 324 xxxP

)|( 13 xxP

Forward sampling: illustration

Let Y be a subset of evidence nodes s.t. Y=u

Forward sampling – How many samples?

Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have:

1

)( 2

yPcT

Derived from Chebychev’s Bound.

222])(,)([)( NeyPyPyPP

Forward sampling: performance

Advantages:• P(xi | pa(xi)) is readily available• Samples are independent !

Drawbacks:• If evidence E is rare (P(e) is low), then we will reject

most of the samples!• Since P(y) in estimate of T is unknown, must estimate

P(y) from samples themselves!• If P(e) is small, T will become very big!

Problem: evidence!

• Forward Sampling High Rejection Rate

• Fix evidence values Gibbs sampling (MCMC) Likelihood Weighting Importance Sampling

Problem: Evidence

• Forward Sampling High rejection rate Samples are independent

• Fix evidence values Gibbs sampling (MCMC) Likelihood Weighting Importance Sampling

Sampling algorithms



• Likelihood Weighting• Importance Sampling

Gibbs Sampling• Markov Chain Monte Carlo method

(Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994)

• Samples are dependent, form Markov Chain• Sample from P’(X|e) which converges to P(X|e)• Guaranteed to converge when all P > 0• Methods to improve convergence:


Gibbs Sampling (Pearl, 1988)

• A sample t[1,2,…], is an instantiation of all variables in the network:

• Sampling process Fix values of observed variables e Instantiate node values in sample x0 at random Generate samples x1,x2,…xT from P(X|e) Compute posteriors from samples

},...,,{ 2211tNN

ttt xXxXxXx

),,...,,|(

...),,...,,|(

),,...,,|(

11

12

11

1

31

121

22

3211

11

exxxxPxX

exxxxPxX

exxxxPxX

tN

ttN

tNN

tN

ttt

tN

ttt

Ordered Gibbs Sampler

Generate sample xt+1 from xt :

In short, for i=1 to N:

ProcessAll variablesIn SomeOrder

),\|( from sampled1 exxxPxX it

itii

Gibbs Sampling (Pearl, 1988)

iX

Markov blanket:

nodesother all oft independen is parents), their andchildren, (parents,

Given

iX

blanketMarkov

)()( jj chX

jiii pachpaXM

:)\|( )\|( :Important it

iit

i xmarkovxPxxxP

ij chX

jjiiit

i paxPpaxPxxxP )|()|()\|(

Ordered Gibbs Sampling Algorithm

Input: X, EOutput: T samples {xt }• Fix evidence E • Generate samples from P(X | E)1. For t = 1 to T (compute samples)2. For i = 1 to N (loop through variables)3. Xi sample xi

t from P(Xi | markovt \ Xi)

Answering Queries

• Query: P(xi |e) = ?• Method 1: count #of samples where Xi=xi:

• Method 2: average probability (mixture estimator):

n

t it

iiii XmarkovxXPT

xXP1

)\|(1)(

TxXsamplesxXP ii

ii)(#)(

Gibbs Sampling - example

X = {X1,X2,…,X9}E = {X9}

X1

X4

X8 X5 X2

X3

X9 X7

X6


X1 = x10 X6 = x6

0

X2 = x20 X7 = x7

0

X3 = x30 X8 = x8

0

X4 = x40

X5 = x50

X1

X4

X8 X5 X2

X3

X9 X7

X6


X1 P (X1 |X02,…,X0

8 ,X9)E = {X9}

P (X1=0 |X02,X0

3 ,X9} = αP(X1=0)P(X0

2|X1=0)P(X30|X1=0)

P (X1=1 |X02,X0

3 ,X9} = αP(X1=1)P(X0

2|X1=1)P(X30|X1=1)

X1

X4

X8 X5 X2

X3

X9 X7

X6


X2 P(X2 |X11,…,X0

8 ,X9}E = {X9}

Markov blanket for X2 is:{X2, X1, X4, X5, X3}

X1

X4

X8 X5 X2

X3

X9 X7

X6

Gibbs Sampling: Illustration

Gibbs Sampling: Burn-In• We want to sample from P(X | E)• But … starting point is random• Solution: throw away first K samples • Known As “Burn-In”• What is K ? Hard to tell. Use intuition.• Alternatives: sample first sample values from

approximate P(x|e) For example, run IBP first

Gibbs Sampling: Convergence• Converge to stationary distribution * :

* = * Pwhere P is a transition kernel

pij = P(Xi Xj)• Guaranteed to converge iff chain is :

irreducible aperiodic ergodic ( i,j pij > 0)

Gibbs Sampling: Performance• Advantage:

guaranteed to converge to P(X|E), as long as Pi > 0• Disadvantage:

convergence may be slow

• Problems: Samples are dependent ! Statistical variance is too big in high-dimensional

problems

Gibbs: Speeding ConvergenceObjectives:1. Reduce dependence between samples

(autocorrelation) Skip samples Randomize Variable Sampling Order

2. Reduce variance Blocking Gibbs Sampling Rao-Blackwellisation

Skipping Samples• Pick only every k-th sample (Gayer, 1992)

Can reduce dependence between samples! Increases variance ! Waists samples !

Randomized Variable Order• Random Scan Gibbs Sampler

Pick each next variable Xi for update at random with probability pi , i pi = 1.

• In the simplest case, pi are distributed uniformly. In some instances, reduces variance (MacEachern, Peruggia, 1999)

Blocking• Sample several variables together, as a block• Example: Given three variables X,Y,Z, with domains of

size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample:

Xt+1 P(yt,zt)=P(wt)(yt+1,zt+1)=Wt+1 P(xt+1)

+ Can improve convergence greatly when two variables are strongly correlated!

- Domain of the block variable grows exponentially with the #variables in a block!

Rao-Blackwellisation• Do not sample all variables!• Sample a subset!• Example: Given three variables X,Y,Z, sample

only X and Y, sum out Z. Given sample (xt,yt), compute next sample:

xt+1 P(yt)yt+1 P(xt+1)

Rao-Blackwell Theorem

Bottom line: reducing number of variables in a sample reduce variance!

Rao-Blackwellised Gibbs: Cutset Sampling

• Select C X (possibly cycle-cutset), |C| = m• Fix evidence E• Initialize nodes with random values:

For i=1 to m: ci to Ci = c 0

i

• For t=1 to n , generate samples:For i=1 to m:

Ci=cit+1 P(ci|c1

t+1,…,ci-1 t+1,ci+1

t,…,cmt ,e)

},...,,{ 2211tKK

ttt cCcCcCc

Cutset Sampling - generating samples

Generate sample ct+1 from ct :

),\|(

),,...,,|(

...),,...,,|(

),,...,,|(

1

11

12

11

1

31

121

22

3211

11

ecccPcC

eccccPcC

eccccPcC

eccccPcC

it

itii

tK

ttK

tKK

tK

ttt

tK

ttt

from sampled

Cutset Sampling• How to choose C ?

Special case: C is cycle-cutset, O(N) General case: apply Bucket Tree Elimination (BTE),

O(exp(w)) where w is the induced width of the network when nodes in C are observed.

Pick C wisely so as to minimize w notion of w-cutset

w-cutset Sampling• C=w-cutset of the network, a set of nodes

such that when C and E are instantiated, the adjusted induced width of the network is w

• Complexity of exact inference: bounded by w !

• Cycle-cutset is a special case!

Cutset Sampling - Answering Queries• Query: ci C, P(ci |e)=? same as Gibbs:• Special case of w-cutset

• Query: P(xi |e) = ?

computed while generating sample t

compute after generating sample t(easy because C is a cut-set)

T

t it

ii ecccPT

|e)(cP1

),\|(1

T

tt

ii ,ecxPT

|e)(xP1

)|(1

Cutset Sampling Example

}{ 05

02

0 ,xxc X1

X7

X5 X4

X2

X9 X8

X3

E=x9

X6


),(

),(1)(

),(

),(

}{

905

''2

905

'2

9052

12

905

''2

905

'2

05

02

0

,xxxBTE

,xxxBTE,x| xxP x

,xxxBTE

,xxxBTE

,xx c

X1

X7

X6 X5 X4

X2

X9 X8

X3

Sample a new value for X2 :


},{

),(

),(1)(

),(

),(

)(

},{

15

12

1

9''

512

9'5

12

9125

15

9''

512

9'5

12

9052

12

05

02

0

xxc

,xxxBTE

,xxxBTE,x| xxP x

,xxxBTE

,xxxBTE

,x| xxP x

xxc

X1

X7

X6 X5 X4

X2

X9 X8

X3

Sample a new value for X5 :


),,|(

),,|(

),,|(

31)|(

),,|(},{

),,|(},{

),,|(},{

935

323

925

223

915

123

93

935

323

35

32

3

925

223

25

22

2

915

123

15

12

1

xxxxP

xxxxP

xxxxP

xxP

xxxxPxxc

xxxxPxxc

xxxxPxxc

X1

X7

X6 X5 X4

X2

X9 X8

X3

Query P(x3 |e) for non-sampled node X3 :

CPCS179 Test Results

MSE vs. #samples (left) and time (right) Non-Ergodic (1 deterministic CPT entry)|X| = 179, |C| = 8, 2<= D(Xi)<=4, |E| = 35Exact Time = 122 sec using Loop-Cutset Conditioning

CPCS179, n=179, |C|=8, |E|=35

0

0.002

0.004

0.006

0.008

0.01

0.012

100 500 1000 2000 3000 4000

# samples

Cutset Gibbs

CPCS179, n=179, |C|=8, |E|=35

0

0.002

0.004

0.006

0.008

0.01

0.012

0 20 40 60 80

Time(sec)

Cutset Gibbs

CPCS360b Test Results

MSE vs. #samples (left) and time (right) Ergodic, |X| = 360, D(Xi)=2, |C| = 21, |E| = 36Exact Time > 60 min using Cutset ConditioningExact Values obtained via Bucket Elimination

CPCS360b, n=360, |C|=21, |E|=36

0

0.00004

0.00008

0.00012

0.00016

0 200 400 600 800 1000

# samples

Cutset Gibbs

CPCS360b, n=360, |C|=21, |E|=36

0

0.00004

0.00008

0.00012

0.00016

1 2 3 5 10 20 30 40 50 60

Time(sec)

Cutset Gibbs

Sampling algorithms




Likelihood Weighting(Fung and Chang, 1990; Shachter and Peot, 1990)

• “Clamping” evidence +• Forward sampling +• Weighting samples by evidence likelihood

Works well for likely evidence!

Likelihood Weighting

e e e e e

Sample in topological order over X !

e e e e

xi P(Xi|pai)P(Xi|pai) is a look-up in CPT!

Likelihood Weighting Outline

EndFor)|(

)|(

)(

Do ForEach 1

)(

)()(

)(

iit

ii

iitt

ii

i

i

t

paXPxX

ElsepaePww

eXEXIf

XXw


T

t

t

ti

T

t

t

ii

w

xxw

ePexPexP

1

)(

)(

1

)( ),(

)(ˆ),(ˆ

)|(ˆ

Estimate Posterior Marginals: P(Xi | e)

otherwise 0 and , contains sample if ,1),( )()(i

tti xxxx


• Converges to exact posterior marginals• Generates samples fast • Sampling distribution is close to prior

(especially if E Leaf Nodes)• Increasing sampling variance

Convergence may be slow Many samples with P(x(t))=0 rejected

Sampling algorithms




Importance Sampling Idea• In general, it is hard to sample from target

distribution P(X|E)• Generate samples from sampling (proposal)

distribution Q(X)• Weigh each sample against P(X|E)

dxxfxQxPdxxffI t )()()()()(

Importance Sampling Theory

Z

EX

n

iii

EX

eEZPeEP

eXpaXPeEEXPeEP

),()(simplify E,\XLet Z

)),(|(),\()(\ 1\


• Given a distribution called the proposal distribution Q (such that P(Z=z,e)>0 => Q(Z=z)>0)

Zz

eEzZPeEP ),()(

)()(

),()( zZQ

zZQeEzZP

eEPZz

Zz

Q zZzQZE )( :value expected of definition By

)()(

),()( zZwEzZQ

eEzZPEeEP QQ

w(Z=z) is called importance weight


)()(

),()( zZwEzZQ

eEzZPEeEP QQ

)()(ˆ ,N

)(1)(

),(1)(ˆ

)z,...,(z Samples

Q fromdrawn samples ofset aGiven

11

n1

eEPeEPAs

zZwNzZQ

eEzZPN

eEPN

i

ii

N

ii

i

Underlying principle, Approximate Average over a set of numbers by an average over a set of sampled numbers

Importance Sampling (Informally)• Express the problem as computing the average over

a set of real numbers• Sample a subset of real numbers• Approximate the true average by sample average.

True Average:• Average of (0.11, 0.24, 0.55, 0.77, 0.88,0.99)=0.59

Sample Average over 2 samples: • Average of (0.24, 0.77) = 0.505

How to generate samples from Q

• Express Q in product form: Q(Z)=Q(Z1)Q(Z2|Z1)….Q(Zn|Z1,..Zn-1)

• Sample along the order Z1,..Zn

• Example: Q(Z1)=(0.2,0.8) Q(Z2|Z1)=(0.2,0.8,0.1,0.9) Q(Z3|Z1,Z2)=Q(Z3|Z1)=(0.5,0.5,0.3,0.7)

N

ii

i

zZQeEzZP

NeEP

1 )(),(1

)(

How to sample from Q?

• Each Sample Z=z Sample Z1=z1 from Q(Z1) Sample Z2=z2 from Q(Z2|Z1=z1) Sample Z3=z3 from Q(Z3|Z1=z1)

• Generate N such samples

)(1)(

),(1)(

)z,...,(z Samples

11

n1

iN

i

N

ii

i

zZwNzZQ

eEzZPN

eEP

Likelihood weighting

• Q= Prior Distribution = CPTs of the Bayesian network

Likelihood weighting example

lung Cancer

Smoking

X-ray

Bronchitis

DyspnoeaP(D|C,B)

P(B|S)

P(S)

P(X|C,S)

P(C|S)

Q=Prior

Q(S,C,D)=Q(S)*Q(C|S)*Q(D|C,B=0)

=P(S)P(C|S)P(D|C,B=0)

Sample S=s from P(S)

Sample C=c from P(C|S=s)

Sample D=d from P(D|C=c,B=0)

N

ii

i

zZQeEzZP

NeEP

1 )(),(1)(

),|1()|0()0,|()|()(

)0,|(),|1()|0()|()()0,|()|()(

)0,1,,,()(

),()(

sScCXPsSBPBcCdDPsScCPsSP

BcCdDPsScCXPsSBPsScCPsSPBcCdDPsScCPsSP

BXdDcCsSPzZQ

eEzZPzZw i

ii

How to solve belief updating?

eEeExX

eEPeExXPeExXP

ii

iiii

is Evidence :rDenominato, is Evidence :Numerator

sampling importanceby r Denominato andNumerator Estimate)(

),()|(

0 , z sample iff 1),(,

)(

)(),()|(ˆ

j

1

1

elsexXcontainszxwhere

zw

zwzxeExXP

iij

i

N

j

j

N

j

jji

ii

Summary

reasoning under uncertainty

Documents

probability of burglary

marginal probability

probabilistic reasoning

burglary b

compute answer reasoning

probabilistic inferencehttp

earthquake e

variablesdiagnostic