reasoning under uncertainty
DESCRIPTION
Reasoning Under Uncertainty. Radu Marinescu 4C @ University College Cork. Why uncertainty?. Uncertainty in medical diagnosis Diseases produce symptoms In diagnosis, observed symptoms => disease ID Uncertainties Symptoms may not occur Symptoms may not be reported - PowerPoint PPT PresentationTRANSCRIPT
Reasoning Under Uncertainty
Radu Marinescu4C @ University College Cork
Why uncertainty?• Uncertainty in medical diagnosis
Diseases produce symptoms In diagnosis, observed symptoms => disease ID Uncertainties
• Symptoms may not occur• Symptoms may not be reported• Diagnostic tests are not perfect
– False positive, false negative
• How do we estimate confidence? P(disease | symptoms, tests) = ?
Why uncertainty?• Uncertainty in medical decision-making
Physicians, patients must decide on treatments Treatments may not be successful Treatments may have unpleasant side effects
• Choosing treatments Weigh risks of adverse outcomes
• People are BAD at reasoning intuitively about probabilities Provide systematic analysis
Outline• Probabilistic modeling with joint distributions• Conditional independence and factorization• Belief (or Bayesian) networks
Example networks and software• Inference in belief networks
Exact inference• Variable elimination, join-tree clustering, AND/OR search
Approximate inference• Mini-clustering, belief propagation, sampling
Bibliography• Judea Pearl. “Probabilistic reasoning in intelligent systems”, 1988
• Stuart Russell & Peter Norvig. “Artificial Intelligence. A Modern Approach”, 2002 (Ch 13-17)
• Kevin Murphy. "A Brief Introduction to Graphical Models and Bayesian Networks"http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html
• Rina Dechter. "Bucket Elimination: A Unifying Framework for Probabilistic Inference"http://www.ics.uci.edu/~csp/R48a.ps
• Rina Dechter. "Mini-Buckets: A General Scheme for Approximating Inference"http://www.ics.uci.edu/~csp/r62a.pdf
• Rina Dechter & Robert Mateescu. "AND/OR Search Spaces for Graphical Models".http://www.ics.uci.edu/~csp/r126.pdf
Reasoning under uncertainty• A problem domain is modeled by a list of (discrete)
random variables: X1, X2, …, Xn
• Knowledge about the problem is represented by a joint probability distribution: P(X1, X2, …, Xn)
Example• Alarm (Pearl88)
Story: In Los Angeles, burglary and earthquake are common. They both can trigger an alarm. In case of alarm, two neighbors John and Mary may call 911
Problem: estimate the probability of a burglary based on who has or has not called
Variables: • Burglary (B), Earthquake (E), Alarm (A), JohnCalls (J), MaryCalls (M)
Knowledge required by the probabilistic approach in order to solve this problem: P(B, E, A, J, M)
Joint probability distributionDefines probabilities for all possible value assignments to the variables in the set
Inference with joint probability distribution
• What is the probability of burglary given that Mary called, P(B=y | M=y)?
• Compute marginal probability:
• Compute answer (reasoning by conditioning):
JAE
MJAEBPMBP,,
),,,,(),(
B M P(B,M)
y y 0.000115
y n 0.000075
n y 0.00015
n n 0.99971
43.000015.0000115.0
000115.0)(
),()|(
yMPyMyBPyMyBP
Advantages• Probability theory well-established and well understood
• In theory, can perform arbitrary inference among the variables given a joint probability. This is because the joint probability contains information of all aspects of the relationships among the variables Diagnostic inference:
• From effects to causes• Example: P(B=y | M=y)
Predictive inference:• From causes to effects• Example: P(M=y | B=y)
Combining evidence: P(B=y | J=y, M=y, E=n)
• All inference sanctioned by probability theory and hence has clear semantics
Difficulty: complexity in model construction and inference
• In Alarm example: 32 numbers needed (parameters) Quite unnatural to assess
• P(B=y, E=y, A=y, J=y, M=y) Computing P(B=y | M=y) takes 29 additions
• In general, P(X1, X2, …, Xn) needs at least 2n numbers to specify the
joint probability distribution Knowledge acquisition difficult (complex, unnatural) Exponential storage and inference
Outline• Probabilistic modeling with joint distributions• Conditional independence and factorization• Belief networks
Example networks and software• Inference in belief networks
Exact inference Approximate inference
• Miscellaneous Mixed networks, influence diagrams, etc.
Chain rule and factorization• Overcome the problem of exponential size by
exploiting conditional independencies The chain rule of probability:
No gains yet. The number of parameters required by the factors is still O(2n)
n
iiin XXXPXXXP
XXXPXXPXPXXXPXXPXPXXP
11121
123121321
12121
),|(),,,(
),|()|()(),,()|()(),(
Conditional independence• A random variable X is conditionally
independent of a set of random variables Y given a set of random variables Z if P(X | Y, Z) = P(X | Z)
• Intuitively: Y tells us nothing more about X than we know by
knowing Z As far as X is concerned, we can ignore Y if we
know Z
Conditional independence• About P(Xi|X1,…,Xi-1):
Domain knowledge usually allows one to identify a subset pa(Xi) {X1, …, Xi-1} such that
• Given pa(Xi), Xi is independent of all variables in {X1,…,Xi-1} \ pa(Xi), i.e.
P(Xi | X1, …, Xi-1) = P(Xi | pa(Xi))
• Then
• Joint distribution factorized!• The number of parameters might have been substantially
reduced
n
iiin XpaXPXXXP
121 ))(|(),...,,(
Example continued
• pa(B) = {}, pa(E) = {}, pa(A) = {B,E}, pa(J) = {A}, pa(M) = {A}• Conditional probability tables (CPT)
)|()|(),|()()(),,,|(),,|(),|()|()(
),,,,(
AMPAJPEBAPEPBPJAEBMPAEBJPEBAPBEPBP
MJAEBP
B P(B)
Y .01
N .99
E P(E)
Y .02
N .98
M A P(M|A)Y Y .9N Y .1Y N .05N N .95
J A P(J|A)Y Y .7N Y .3Y N .01N N .99
A B E P(A|B,E)
Y Y Y .95
N Y Y .05
Y Y N .94
N Y N .06
Y N Y .29
N N Y .71
Y N N .001
N N N .999
Example continued• Model size reduced from 32 to 2+2+4+4+8=20• Model construction easier
Fewer parameters to assess Parameter more natural to assess
• e.g., P(B=y), P(J=y | A=y), P(A=y | B=y, E=y), etc.
• Inference easier. Will see this later.
Outline• Probabilistic modeling with joint distributions• Conditional Independence and factorization• Belief networks
Example networks and software• Inference in belief networks
Exact inference Approximate inference
From factorization to belief networks• Graphically represent the conditional independency
relationships: Construct a directed graph by drawing an arc from Xj to Xi iff Xj
pa(Xi)
Also attach the CPT P(Xi | pa(Xi)) to node Xi
B E
A
J M
P(B) P(E)
P(A|B,E)
P(J|A) P(M|A)
Formal definition• A belief network is:
A directed acyclic graph (DAG), where:• Each node represents a random variable• And is associated with the conditional probability of the node given
its parents Represents the joint probability distribution:
A variable is conditionally independent of its non-descendants given its parents
n
iiin XpaXPXXXP
121 ))(|(),...,,(
Independences in belief networks• 3 basic independence structures
Burglary
Alarm
JohnCalls
1: chain
Burglary
Alarm
Earthquake
2: common descendants
MaryCalls
Alarm
JohnCalls
3: common ancestors
Independences in belief networks
Burglary
Alarm
JohnCalls
1. JohnCalls is independent of Burglary given Alarm
)|()|()|,()|(),|(
ABPAJPABJPAJPBAJP
Independences in belief networks
Burglary
Alarm
Earthquake
2. Burglary is independent of Earthquake not knowing Alarm.Burglary and Earthquake become dependent given Alarm!!
)|()|()|,()()(),(
AEPABPAEBPEPBPEBP
Independences in belief networks
MaryCalls
Alarm
JohnCalls
3. MaryCalls is independent of JohnCalls given Alarm.
)|()|()|,()|(),|(
AMPAJPAMJPAJPMAJP
Independences in belief networks• BN models many conditional independence relations relating distant
variables and sets, which are defined in terms of the graphical criterion called d-separation
• d-separation = conditional independence Let X, Y and Z be three sets of nodes If X and Y are d-separated by Z, then X and Y are conditionally independent given
Z: P(X|Y, Z) = P(X|Z)
• d-separation in the graph: A is d-separated from B given C if every undirected path between them is
blocked
• Path blocking 3 cases that expand on three basic independence structures
Undirected path blocking• With “linear” substructure
• With “wedge” substructure (common ancestors)
• With “vee” substructure (common descendants)
X YZ in C
X YZ in C
X
Y
Z or any of its descendants not in C
Example
1
2 3
4
5
X Y
ZX = {2} and Y = {3} are d-separated by Z = {1}
• path 2 1 3 is blocked by 1 Z• path 2 4 3 is blocked because 4 and all itsdescendants are outside Z
X = {2} and Y = {3} are not d-separated by Z = {1,5}
• path 2 1 3 is blocked by 1 Z• path 2 4 3 is activated because 5 (which isa descendant of 4) is in Z
• learning the value of consequence 5 renders5’s causes 2 and 3 dependant
)|()|()|,( ZYPZXPZYXP
)|()|()|,( ZYPZXPZYXP
I-mapness• Given a probability distribution P on a set of
variables {X1, …, Xn}, a belief network B representing P is a minimal I-map (Pearl88) I-mapness: every d-separation condition displayed
in B corresponds to a valid conditional independence relationship in P
Minimal: none of the arrows in B can be deleted without destroying its I-mapness
Full joint distribution in BN
B E
A
J M
P(B,E,A,J,M) =
Rewrite the full joint probability using the product rule:
= P(J|B,E,A,M) P(B,E,A,M)
= P(J|A)P(B,E,A,M)
P(M|B,E,A) P(B,E,A) P(M|A) P(B,E,A)
P(A|B,E) P(B,E)
P(B) P(E)
= P(J|A) P(M|A) P(A|B,E) P(B) P(E)
Example network
PCWP COHRBP
HREKG HRSAT
ERRCAUTERHRHISTORY
CATECHOL
SAO2 EXPCO2
ARTCO2
VENTALV
VENTLUNG VENITUBE
DISCONNECT
MINVOLSET
VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS
PAP SHUNT
ANAPHYLAXIS
MINOVL
PVSAT
FIO2PRESS
INSUFFANESTHTPR
LVFAILURE
ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME
HYPOVOLEMIA
CVP
BP
The “alarm” network: Monitoring Intensive-Care Patients37 variables, 509 parameters (instead of 237)
Software• GeNIe (University of Pittsburgh) - free
http://genie.sis.pitt.edu• SamIam (UCLA) - free
http://reasoning.cs.ucla.edu/SamIam/• Hugin - commercial
http://www.hugin.com• Netica - commercial
http://www.norsys.com• UCI Lab – free but no GUI
http://graphmod.ics.uci.edu/
GeNIe screenshot
Applications• Belief networks are used in:
Genetic linkage analysis Speech recognition Medical diagnosis Probabilistic error correcting coding Monitoring and diagnosis in distributed systems Troubleshooting (Microsoft) …
Outline• Probabilistic modeling with joint distributions• Conditional independence and factorization• Belief networks• Inference in belief networks
Exact inference Approximate inference
Exact inference• Variable elimination (inference)
Bucket elimination Bucket-Tree elimination Cluster-Tree elimination
• Conditioning (search) VE+C hybrid AND/OR search (tree, graph)
Belief updating
Smoking
BronchitisLung cancer
X-ray Dyspnoea
P(Lung cancer = yes | Smoking = no, Dyspnoea = yes) ?
Probabilistic inference tasks• Belief updating
• Maximum probable explanation (MPE)
• Maximum a posteriori hypothesis (MAP)
)|()( evidencexXPXBEL iii
),(maxarg* exPxx
AXa
k exPaa\
**1 ),(maxarg),...,(
Belief updating: P(X|evidence) = ?
A
B C
ED
P(A)
P(C|A)P(B|A)
P(E|B,C)
P(D|A,B)
P(A|E=0)
∑E=0,D,C,B P(A) P(B|A) P(C|A) P(D|A,B) P(E|B,C) =
α P(A,E=0) =
P(A) ∑E=0 ∑D ∑C P(C|A) ∑B P(B|A) P(D|A,B) P(E|B,C)
λB(A,D,C,E)Variable Elimination
Bucket elimination
A
B C
ED
P(A)
P(B|A)
P(E|B,C)
P(D|A,B)
A
B C
ED
Moralize (“marry parents”)
Bucket B:
Bucket C:
Bucket D:
Bucket E:
Bucket A:
P(E|B,C), P(D|A,B), P(B|A)
P(C|A)
E=0
P(A)
Ordering: A, E, D, C, B
P(C|A)
The bucket operation
ELIMINATION: multiply (*) and sum (∑)
bucket(B): { P(E|B,C), P(D|A,B), P(B|A) }
λB(A,C,D,E) = ∑B P(B|A)*P(D|A,B)*P(E|B,C)
OBSERVED BUCKET:
bucket(B): { P(E|B,C), P(D|A,B), P(B|A), B=1 }
λB(A) = P(B=1|A) λB(A,D) = P(D|A,B=1)
λB(E,C) = P(E|B=1,C)
Multiplying functions
Summing out a variable
Bucket elimination
Bucket B:
Bucket C:
Bucket D:
Bucket E:
Bucket A:
P(E|B,C), P(D|A,B), P(B|A)
P(C|A)
E=0
P(A)
∑∏ Elimination operator
λB(A,D,C,E)
λC(A,D,E)
λD(A,E)
λE(A)
P(A,E=0)
B
C
D
E
A
w* = 4“induced width”(max clique size)
Induced graph
B
C
D
E
A
A
B C
ED
P(A)
P(B|A)
P(E|B,C)
P(D|A,B)
P(C|A)
Induced width of the ordering w*(d)||
max width of the nodes
A
B C
ED
Complexity of elimination))(*exp(( dwnO
w*(d) – induced width of the moral graph along ordering d
A
B C
ED
“Moral” graph
B
C
D
E
A
w*(d1) = 4
E
D
C
B
A
w*(d2) = 2
Finding small induced-width orderings• NP-complete• A tree has induced width of ?• Greedy algorithms:
Min-width Min induced-width Max-cardinality Min-fill (thought as the best) Anytime min-width (via Branch-and-Bound)
MPE: Most Probable Explanation
Smoking
BronchitisLung Cancer
X-ray Dyspnoea
)1,,,,0(maxarg)1,',',',0( DXBCSPdxcbs
),|1(),0|()0|()0|()0()1,,,,0( CBDPCSXPSBPSCPSPDXBCSP
Applications• Probabilistic decoding
A stream of bits is transmitted across a noisy channel and the problem is to recover the transmitted stream given the observed output and parity check bits
x0 x1 x2 x3 x4
u0 u1 u2 u3 u4
y0u y1
u y2u y3
u y4u
y0x y1
x y2x y3
x y4x
Transmitted bits
Parity check bits
Received bits (observed)
Received parity check bits (observed)
Applications• Medical diagnosis
Given some observed symptoms, determine the most likely subset of diseases that may explain the symptoms
Symptom2
Symptom3
Symptom4
Symptom5
Symptom1 Symptom6
Disease1 Disease2 Disease4
Disease6Disease5
Disease3 Disease7
Applications• Genetic linkage analysis
Given the genotype information of a pedigree, infer the maximum likelihood haplotype configuration (maternal and paternal) of the unobserved individuals
2 1A AB B
a ab b
A aB b 3
genotyped
haplotype
S23m
L21fL21m
L23m
X21 S23f
L22fL22m
L23f
X22
X23
S13m
L11fL11m
L13m
X11 S13f
L12fL12m
L13f
X12
X13
Locus 1
Locus 2
(Fishelson & Geiger, 2002)
Bucket elimination for MPE
A
B C
ED
P(A)
P(C|A)P(B|A)
P(E|B,C)
P(D|A,B)
MPE =
maxA,E=0,D,C,B P(A) P(B|A) P(C|A) P(D|A,B) P(E|B,C) =
maxAP(A) maxE=0 maxD maxC P(C|A) maxB P(B|A) P(D|A,B) P(E|B,C)
λB(A,D,C,E)Variable Elimination
Max out a variable
A B C f(A,B,C)
T T T 0.03
T T F 0.07
T F T 0.54
T F F 0.36
F T T 0.06
F T F 0.14
F F T 0.48
F F F 0.32
A C f(A,C)
T T 0.54
T F 0.36
F T 0.48
F F 0.32
max out B
Bucket elimination
Bucket B:
Bucket C:
Bucket D:
Bucket E:
Bucket A:
P(E|B,C), P(D|A,B), P(B|A)
P(C|A)
E=0
P(A)
max∏ Elimination/combination operators
λB(A,D,C,E)
λC(A,D,E)
λD(A,E)
λE(A)
MPE value
B
C
D
E
A
w* = 4“induced width”(max clique size)
width
4
3
1
1
0
Generating the MPE tupleBucket B:
Bucket C:
Bucket D:
Bucket E:
Bucket A:
P(E|B,C), P(D|A,B), P(B|A)
P(C|A)
E=0
P(A)
λB(A,D,C,E)
λC(A,D,E)
λD(A,E)
λE(A)a’ = argmax P(A)∙λE(A)
e’ = 0
d’ = argmax λC(a’,D,e’)
c’ = argmax P(C|a’) ∙∙ λC(a’,d’,C,e’)
b’ = argmax P(e’|B,c’) ∙ ∙ P(d’|a’,B) P(B|a’)∙
Return (a’, b’, c’, d’, e’)
Complexity of elimination))(*exp(( dwnO
w*(d) – induced width of the moral graph along ordering d
A
B C
ED
“Moral” graph
B
C
D
E
A
w*(d1) = 4
E
D
C
B
A
w*(d2) = 2
Exact inference• Variable elimination (inference)
Bucket elimination Bucket-Tree elimination Cluster-Tree elimination
• Conditioning (search) VE+C hybrid AND/OR search (tree, graph)
From BE to Bucket-Tree elimination• Motivation
BE computes P(evidence) or P(X|evidence) where X is the last variable in the ordering
What if we need all marginal probabilities P(Xi|evidence), where Xi {X1, X2, …, Xn} ?
• Run BE n times with Xi being the last variable• Inefficient! – induced width may vary significantly from
one ordering to another• SOLUTION: Bucket-Tree Elimination (BTE)
Bucket-Tree eliminationA
B C
ED
P(A)
P(B|A)
P(E|B,C)
P(D|A,B)
Bucket E:
Bucket D:
Bucket C:
Bucket B:
Bucket A:
P(E|B,C)
P(D|A,B)
P(B|A)
P(A)
P(C|A) λE(B,C)
λD(A,B) λC(A,B)
λB(A)
P(E|B,C)
P(D|A,B)
P(C|A)
P(B|A)
P(A)
E
D
C
B
A
λE(B,C)
λD(A,B)λC(A,B)
λB(A)
• Variable elimination can be viewed asmessage passing (elimination) using abucket tree• Any node (bucket) can be the root• Complexity: time and space exponentialin the induced width
P(C|A)
Bucket-Tree (more formal)• Bucket Tree
A bucket tree has each bucket Bi as a node and there is an arc from Bi to Bj if the function created at Bi was placed in Bj
• Graph-based definition Let Gd be the induced graph along d. Each variable
X and its earlier neighbors is a node BX. There is an arc from BX to BY if Y is the closest parent to X.
Bucket-Tree
A
B C
ED
P(A)
P(B|A)
P(E|B,C)
P(D|A,B)
Belief network
E
D
C
B
A
Induced graph
E,B,C
A,B,D
A,B,C
B,A
A
E
D
C
B
A
λE(B,C)
λD(A,B)λC(A,B)
λB(A)
Bucket tree
P(C|A)
Bucket-Tree propagation
u
Xn
X2
X1
v
h(u,v)
)},(),,(),...,,(),,({)()( 21 uvhuxhuxhuxhuubucket n
…
Compute the message:
),(elim)},({)(
),(vu
uvhubucketffvuh
h(x1,u)
h(xn,u) elim(u,v) = vars(u) – vars(v)
Upward messages in the bucket-tree
E,B,C
A,B,D
A,B,C
B,A
A
E
D
C
B
A
λE(B,C)
λD(A,B)λC(A,B)
λB(A)πA(A)
πC(B,C)
πB(A,B)πB(A,B)
A
CB
EC
DBA
CB
CDB
BA
BAACPCB
BAAABPBA
BAABPBA
APA
),()|(),(
),()()|(),(
),()|(),(
)()(
A
B C
ED
P(A)
P(B|A)
P(E|B,C)
P(D|A,B)
P(C|A)
Computing marginals from the bucket-tree
E,B,C : P(E|B,C)
A,B,D : P(D|A,B)
A,B,C : P(C|A)
B,A : P(B|A)
A : P(A)
E
D
C
B
A
λE(B,C)
λD(A,B)λC(A,B)
λB(A)πA(A)
πC(B,C)
πB(A,B)πB(A,B)
BA
ECB CBBAACPevidenceCP
,
),(),()|()|(
Buckets -> Super-buckets -> Clusters
G,F
F,B,C
D,B,A
A,B,C
B,A
A
F
B,C A,BA,B
A
G,F
F,B,C
D,B,A
A,B,C
F
B,CA,B
G,F
A,B,C,D,F
F
A
B C
FD
P(A)
P(B|A)
P(F|B,C)
P(D|A,B)
Time-space trade off! G
P(C|A)
P(G|F)
Tree decomposition• A tree decomposition for a belief network ‹X,D,G,P› is a
triple ‹T,χ,ψ›, where T=(V,E) is a tree, and χ and ψ are labeling functions, associating with each vertex v V two sets χ(v) V and ψ(v) P such that: For each function (CPT) pi P there is exactly one vertex such
that pi ψ(v) and scope(pi) χ(v) For each variable Xi X, the set {v V | Xi χ(v)} forms a
connected sub-tree (running intersection property)
• A join-tree is a tree decomposition where all clusters are maximal E.g., a bucket-tree is a tree decomposition but not a join-tree
Treewidth and separator• The width (aka treewidth) of a tree
decomposition ‹T,χ,ψ› is max|χ(v)|, and its hyperwidth is max|ψ(v)|
• Given two adjacent vertices u and v of a tree decomposition, a separator of u and v is defined as sep(u,v) = χ(u) χ(v)
Finding join-tree decompositions• Good join trees using triangulation
Create induced graph G’ along some ordering d Identify all maximal cliques in G’ Order cliques {C1, C2, …, Ct} by rank of the highest
vertex in each clique Form the join tree by connecting each Ci to a
predecessor Cj (j < i) sharing the largest number of vertices with Ci
Example
E
D
C
B
A
Induced graph
A
B C
ED
Moral graph
ECBC3
DBAC2
CBAC1
P(A)
P(B|A) P(C|A)
P(E|B,C)P(D|A,B)
BC
P(E|B,C)
P(D|A,B)
P(A), P(B|A), P(C|A)
AB
Treewidth = 3Separator size = 2
χ(C3)
ψ(C3)
separators
Tree decomposition for belief updating
A
B
C D
F
E
G
ABCP(A), P(B|A), P(C|
A,B)
BCDFP(D|B), P(F|
C,D)
BEFP(E|E,F)
EFGP(G|E,F)
BC
BF
EF
1
2
3
4
Tree decomposition for belief updating
A
B
C D
F
E
G
ABC
BCDF
BEF
EFG
BC
BF
EF
A
BACPABPAPCB ),|()|()(),()2,1(
DC
CBDCFPBDPFB,
)2,1()3,2( ),(),|()|(),(
B
FBFBEPFE ),(),|(),( )3,2()4,3(
),|(),()3,4( FEgGPFE
E
FEFBEPFB ),(),|(),( )3,4()2,3(
FD
FBDCFPBDPCB,
)2,3()1,2( ),(),|()|(),(
1
2
3
4Time: O(exp(w+1))Space: O(exp(sep))
CTE - properties• Correctness and completeness
Algorithm CTE is correct, i.e. it computes the exact joint probability of a single variable and the evidence
• Time complexity: O(deg x (n+N) x dw*+1)• Space complexity: O(N x dsep)
» deg = max degree of a node in T» n = number of variables (=number of CPTs)» N = number of nodes in T» d = maximum domain size» w* = induced width» sep = separator size
Exact inference• Variable elimination (inference)
Bucket elimination Bucket-Tree elimination Cluster-Tree elimination
• Conditioning (search) Cycle cutset scheme VE+C hybrid AND/OR search (tree, graph)
Conditioning
0 0 0 0
0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1
E
C
D
B
A 0 1
A
B C
ED
P(A)
P(B|A)
P(E|B,C)
P(D|A,B)
P(C|A)
?)0(
)0,()0|(
EP
EAPEAP
P(A=0)P(B=0|A=0)P(C=0|A=0)P(E=0|B=0,C=0)P(D=0|A=0,B=0)P(A=0)P(B=0|A=0)P(C=0|A=0)P(E=0|B=0,C=0)P(D=1|A=0,B=0)
…P(A=0)P(B=1|A=0)P(C=1|A=0)P(E=0|B=1,C=1)P(D=1|A=0,B=1)
∑ = P(A=0, E=0)
Conditioning
0 0 0 0
0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1
E
C
D
B
A 0 1
A
B C
ED
P(A)
P(B|A)
P(E|B,C)
P(D|A,B)
P(C|A)
?)0(
)0,()0|(
EP
EAPEAP
P(A=0, E=0) P(A=1, E=0)
)0,1()0,0()0,0()0|0(
EAPEAP
EAPEAP
)0,1()0,0()0,1()0|1(
EAPEAP
EAPEAP
Conditioning + Elimination
IDEA: condition until w* of the remaining graph gets small enough!
0 0 0 0
0 1 0 1
E
C
D
B
A 0 1
Search
Elimination
A
B C
ED
P(A)
P(B|A)
P(E|B,C)
P(D|A,B)
P(C|A)
w* = 1 w*
loop cutset
w* = ww* = 0
search w-cutset elimination
?)0( EP
Loop-cutset method• Condition until we get a polytree (no loops)
subset of conditioning variables = loop-cutset
A
B C
ED
B C
ED
A=0 A=0 A=0
B C
ED
A=1 A=1 A=1
P(B|D=0) = P(B,A=0|D=0) + P(B,A=1|D=0)
Loop-cutset method is time exponential in loop-cutset size and linear space!
w-cutset method
• Identify a w-cutset, Cw, of the network Finding smallest loop-cutset/w-cutset is NP-hard
• For each assignment of the cutset, solve by VE the conditioned subproblem
• Aggregate the solutions over all cutset assignments
• Time complexity: exp(|Cw| + w)• Space complexity: exp(w)
Interleaving Conditioning and Elimination
Interleaving Conditioning and Elimination
Eliminate
Interleaving Conditioning and Elimination
Interleaving Conditioning and EliminationEliminate
Interleaving Conditioning and Elimination
Interleaving Conditioning and Elimination
Condition
Interleaving Conditioning and Elimination
...
...
General graphical models• All algorithms generalize to any graphical
model Through general operations of combination and
marginalization General BE, BTE, CTE, VE+C Applicable to Markov networks, to constraint
optimization, to counting number of solutions in SAT/CSP, etc.
Exact inference• Variable elimination (inference)
Bucket elimination Bucket-Tree elimination Cluster-Tree elimination
• Conditioning (search) VE+C hybrid Cycle cutset scheme AND/OR search (tree, graph)
Solution techniques
Search: Conditioning
Complete
Incomplete
Gradient Descent
Complete
Incomplete
Tree ClusteringVariable Elimination
Mini-Clustering(i)Mini-Bucket(i)
Stochastic Local SearchDFS search
Inference: Elimination
Time: exp(treewidth)Space:exp(treewidth)
Time: exp(n)Space: linear
AND/OR searchTime: exp(treewidth*log n)Space: linear
Hybrids
Space: exp(treewidth)Time: exp(treewidth)
Time: exp(pathwidth)Space: exp(pathwidth)
Belief Propagation
Bucket Elimination
Exact inference• Variable elimination (inference)
Bucket elimination Bucket-Tree elimination Cluster-Tree elimination
• Conditioning (search) Cycle cutset VE+C hybrid AND/OR search spaces
• AND/OR tree search• AND/OR graph search
OR search spaceA
D
B C
E
F
Ordering: A B E C D F
A
D
B C
E
F
0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1
E
C
F
D
B
A 0 1
AND/OR search space
AOR
0AND 1
BOR B
0AND 1 0 1
EOR C E C E C E C
OR D F D F D F D F D F D F D F D F
AND 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
AND 0 10 1 0 10 1 0 10 1 0 10 1
A
D
B C
E
F
A
D
B
CE
F
Moral graph DFS tree
A
D
B C
E
F
A
D
B C
E
F
OR vs. AND/ORAOR
0AND 1
BOR B
0AND 1 0 1
EOR C E C E C E C
OR D F D F D F D F D F D F D F D F
AND 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
AND 0 10 1 0 10 1 0 10 1 0 10 1
E 0 1 0 1 0 1 0 1
0C 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
F 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
D 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0B 1 0 1
A 0 1
E 0 1 0 1 0 1 0 1
0C 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
F 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
D 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0B 1 0 1
A 0 1
AND/OR
OR
A
D
B C
E
FA
D
B
CE
F
1
1
1
0
1
0
OR vs. AND/OR
92
AOR
0AND 1
BOR B
0AND 1 0 1
EOR C E C E C E C
OR D F D F D F D F D F D F D F D F
AND 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
AND 0 10 1 0 10 1 0 10 1 0 10 1
E 0 1 0 1 0 1 0 1
0C 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
F 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
D 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0B 1 0 1
A 0 1
AND/OR
OR
A
D
B C
E
FA
D
B
CE
F
AND/OR size: exp(4), OR size exp(6)
AND/OR search spaces• The AND/OR search tree of R relative to a spanning-tree, T, has:
Alternating levels of: OR nodes (variables) and AND nodes (values)
• Successor function: The successors of OR nodes X are all its consistent values along its path The successors of AND <X,v> are all X child variables in T
• A solution is a consistent subtree• Task: compute the value of the root node
0
A
B
0
E
OR
AND
OR
AND
OR
AND
C
0
OR
AND
D
0 1
F
0 1
1
D
0 1
F
0 1
0 1
1
E C
0
D
0 1
F
0 1
1
D
0 1
F
0 1
0 1
1
B
0
E C
0
D
0 1
F
0 1
1
D
0 1
F
0 1
0 1
1
E C
0
D
0 1
F
0 1
1
D
0 1
F
0 1
0 1
A
D
B C
E
F
A
D
B
CE
F
From DFS trees to pseudo trees
(a) Graph
4 61
3 2 7 5
(b) DFS treedepth=3
(c) Pseudo treedepth=2
(d) Chaindepth=6
4 6
1
3
2 7
5 2 7
1
4
3 5
6
4
6
1
3
2
7
5
(Freuder85, Bayardo&Miranker95)
Pseudo tree vs. DFS tree
Model (DAG) w* Pseudo tree avg. depth
DFS tree avg. depth
(N=50, P=2) 9.54 16.82 36.03(N=50, P=3) 16.1 23.34 40.6(N=50, P=4) 20.91 28.31 43.19(N=100, P=2) 18.3 27.59 72.36(N=100, P=3) 30.97 41.12 80.47(N=100, P=4) 40.27 50.53 86.54
N = number of nodes, P = number of parents. MIN-FILL ordering. 100 instances.
Finding min-depth backbone trees• Finding min depth DFS, or pseudo tree is NP-
complete, but:• Given a tree-decomposition whose treewidth
is w*, there exists a pseudo tree T of G whose depth, satisfies:
m <= w* log n
(Bayardo & Miranker96, Bodlaender & Gilbert91)
Generating pseudo trees from bucket trees
FAC
B DE
E
A
C
B
D
F (AF) (EF)
(A)
(AB)
(AC) (BC)
(AE)
(BD) (DE)
Bucket-tree based on dd: A B C E D F
E
A
C
B
D
F
Induced graph
E
A
C
B
D F
Bucket-tree used as pseudo tree
AND
AND
AND
AND
0 1
BOR B
0 1 0 1
COR E C E C E C E
OR D F D F D F D F D F D F D F D F
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 10 1 0 10 1 0 10 1 0 10 1
AOR
AND/OR search tree
Bucket-tree
ABE
A
ABC
AB
BDE AEF
bucket-A
bucket-E
bucket-B
bucket-C
bucket-D bucket-F
(AE) (BE)
Other heuristics for pseudo trees• Depth-first traversal of the induced graph
constructed along some elimination ordering (e.g., min-fill) Sometimes can get slightly different trees than those
obtained from the bucket-tree
• Recursive decomposition of the dual hypergraph while minimizing the separator size at each step Functions (CPTs) are vertices in the dual hypergraph,
while variables are hyperedges Separator = set of hyperedges (i.e., variables)
Quality of the pseudo trees
Network hypergraph min-fill width depth width depthbarley 7 13 7 23diabetes 7 16 4 77link 21 40 15 53mildew 5 9 4 13munin1 12 17 12 29munin2 9 16 9 32munin3 9 15 9 30munin4 9 18 9 30water 11 16 10 15pigs 11 20 11 26
Bayesian Networks Repository
AND/OR search tree properties• Theorem: Any AND/OR search tree based on a pseudo tree is
sound and complete (expresses all and only solutions)
• Theorem: Size of AND/OR search tree is O(n km)Size of OR search tree is O(kn)
• Theorem: Size of AND/OR search tree can be bounded by O(exp(w* log n))
• Related to: (Freuder85; Dechter90, Bayardo et. al. 96, Darwiche01, Bacchus et. al. 03)
• When the pseudo-tree is a chain we get an OR space
AND/OR vs. OR spaces
width depthOR space AND/OR space
Time (sec.) Nodes Time (sec.) AND nodes OR nodes
5 10 3.15 2,097,150 0.03 10,494 5,247
4 9 3.13 2,097,150 0.01 5,102 2,551
5 10 3.12 2,097,150 0.03 8,926 4,463
4 10 3.12 2,097,150 0.02 7,806 3,903
5 13 3.11 2,097,150 0.10 36,510 18,255
Random graphs with 20 nodes, 20 edges and 2 values per node
Tasks and values of nodes• v(n) is the value of the tree T(n) for the task:
Optimization (MPE): v(n) is the optimal solution in T(n) Belief updating: v(n), probability of evidence in T(n).
• Goal: compute the value of the root node recursively using DFS search of the AND/OR tree.
• Theorem: Complexity of AO DFS search is: Space: O(n) Time: O(n km) Time: O(exp(w* log n))
Weighted AND/OR tree (belief updating)
0
A
B
0
E
OR
AND
OR
AND
OR
AND
C
0
OR
AND
D
0 1
1
D
0 1
0 1
1
E C
0
D
0 1
1
D
0 1
0 1
1
B
0
E C
0
D
0 1
1
D
0 1
0 1
1
E C
0
D
0 1
1
D
0 1
0 1
A
D
B C
E
A
D
B
CE
B C D=0 D=10 0 .2 .80 1 .1 .91 0 .3 .71 1 .5 .5
),|( CBDP
.7.8 .9 .5 .7.8 .9 .5
Evidence: D=1
A B E=0 E=10 0 .4 .60 1 .5 .51 0 .7 .31 1 .2 .8
),|( BAEP
Evidence: E=0
.4 .5 .7 .2
A B=0 B=10 .4 .61 .1 .9
)|( ABPA C=0 C=10 .2 .81 .7 .3
)|( ACPA P(A)0 .61 .4
)(AP
.2 .8 .2 .8 .1 .9 .1 .9
.4 .6 .1 .9
.6 .4
D: P(D|B,C) D=1C: P(C|A)E: P(E|A,B) E=0B: P(B|A)A: P(A)
w(X,x) = product of CPTs that contain X and their scope is fully instantiated along the path
Computing node values (belief updating)
k
i
iviwv1
),A(),A(AOR node
1
A
2 k
w(A,1)w(A,2)
w(A,k)
v(A,1) v(A,1) v(A,1)…
AND node
0
X1 X2 Xm…v(X1) v(X2) v(Xm)
m
iiXvv
1
0,A
NOTE: • the value of a terminal AND node is 1• the weight of an OR-AND arc for which no CPTs are fully instantiated is 1
AND/OR tree algorithm (belief updating)
AND node: Combination operator (product)
OR node: Marginalization operator (summation)
Value of node = updated belief for sub-problem below
0
A
B
0
E
OR
AND
OR
AND
OR
AND
C
0
OR
AND
D
0 1
1
D
0 1
0 1
1
E C
0
D
0 1
1
D
0 1
0 1
1
B
0
E C
0
D
0 1
1
D
0 1
0 1
1
E C
0
D
0 1
1
D
0 1
0 1
A
D
B C
E
A
D
B
CE
B C D=0 D=10 0 .2 .80 1 .1 .91 0 .3 .71 1 .5 .5
),|( CBDP
.7.8 .9 .5 .7.8 .9 .5
Evidence: D=1
A B E=0 E=10 0 .4 .60 1 .5 .51 0 .7 .31 1 .2 .8
),|( BAEP
Evidence: E=0
.4 .5 .7 .2
A B=0 B=10 .4 .61 .1 .9
)|( ABPA C=0 C=10 .2 .81 .7 .3
)|( ACPA P(A)0 .61 .4
)(AP
.2 .8 .2 .8 .1 .9 .1 .9
.4 .6 .1 .9
.6 .4
.8 .9
.8 .9
.7 .5
.7 .5
.8 .9
.8 .9
.7 .5
.7 .5
.4 .5 .7 .2.88 .54 .89 .52
.352 .27 .623 .104
.3028 .1559
.24408
.3028 .1559
Result: P(D=1,E=0)
0.3028*0.6 + 0.1559*0.4 = 0.24408
Complexity of AND/OR tree search
AND/OR tree OR tree
Space O(n) O(n)
TimeO(n km)
O(n kw* log n)(Freuder & Quinn85), (Collin, Dechter & Katz91), (Bayardo & Miranker95), (Darwiche01)
O(kn)
k = domain sizem = depth of pseudo-treen = number of variablesw*= treewidth
Exact inference• Variable elimination (inference)
Bucket elimination Bucket-Tree elimination Cluster-Tree elimination
• Conditioning (search) VE+C hybrid AND/OR search spaces
• AND/OR tree search• AND/OR graph search
From search trees to search graphs• Any two nodes that root identical sub-trees or
sub-graphs can be merged
From search trees to search graphs• Any two nodes that root identical sub-trees or
sub-graphs can be merged
AND/OR search treeA
D
B C
E
F
G H
J
K
A
D
B
CE
F
G
H
J
KAOR
0AND 1
BOR B
0AND 1 0 1
EOR C E C E C E C
OR D F D F D F D F D F D F D F D F
AND
AND 0 10 1 0 10 1 0 10 1 0 10 1
OR
OR
AND
AND
0
G
H H
0 1 0 1
0 1
1
G
H H
0 1 0 1
0 1
0
J
K K
0 1 0 1
0 1
1
J
K K
0 1 0 1
0 1
0
G
H H
0 1 0 1
0 1
1
G
H H
0 1 0 1
0 1
0
J
K K
0 1 0 1
0 1
1
J
K K
0 1 0 1
0 1
0
G
H H
0 1 0 1
0 1
1
G
H H
0 1 0 1
0 1
0
J
K K
0 1 0 1
0 1
1
J
K K
0 1 0 1
0 1
0
G
H H
0 1 0 1
0 1
1
G
H H
0 1 0 1
0 1
0
J
K K
0 1 0 1
0 1
1
J
K K
0 1 0 1
0 1
0
G
H H
0 1 0 1
0 1
1
G
H H
0 1 0 1
0 1
0
J
K K
0 1 0 1
0 1
1
J
K K
0 1 0 1
0 1
0
G
H H
0 1 0 1
0 1
1
G
H H
0 1 0 1
0 1
0
J
K K
0 1 0 1
0 1
1
J
K K
0 1 0 1
0 1
0
G
H H
0 1 0 1
0 1
1
G
H H
0 1 0 1
0 1
0
J
K K
0 1 0 1
0 1
1
J
K K
0 1 0 1
0 1
0
G
H H
0 1 0 1
0 1
1
G
H H
0 1 0 1
0 1
0
J
K K
0 1 0 1
0 1
1
J
K K
0 1 0 1
0 1
AND/OR search graph
AOR
0AND 1
BOR B
0AND 1 0 1
EOR C E C E C E C
OR D F D F D F D F D F D F D F D F
AND
AND 0 10 1 0 10 1 0 10 1 0 10 1
OR
OR
AND
AND
0
G
H H
0 1 0 1
0 1
1
G
H H
0 1 0 1
0 1
0
J
K K
0 1 0 1
0 1
1
J
K K
0 1 0 1
0 1
A
D
B C
E
F
G H
J
K
A
D
B
CE
F
G
H
J
K
Merging based on context• One way of recognizing nodes that can be merged
context(X) = ancestors of X in the pseudo tree that are connected to X, or to descendants
of X
[ ]
[A]
[AB]
[AE][BC]
[AB]
A
D
B
EC
F
pseudo tree
A
E
C
B
F
D
A
E
C
B
F
D
AND/OR graph algorithm (belief updating)
.7.8
0
A
B
0
E C
0
D
0 1
1
D
0 1
0 1
1
E C
0
D
0 1
1
D
0 1
0 1
1
B
0
E C
0 10 1
1
E C
0 10 1
A
D
B C
E
B C D=0 D=10 0 .2 .80 1 .1 .91 0 .3 .71 1 .5 .5
),|( CBDP.7.8 .9 .5
A B E=0 E=10 0 .4 .60 1 .5 .51 0 .7 .31 1 .2 .8
),|( BAEP
Evidence: E=0
.4 .5 .7 .2
A B=0 B=10 .4 .61 .1 .9
)|( ABPA C=0 C=10 .2 .81 .7 .3
)|( ACPA P(A)0 .61 .4
)(AP
.2 .8 .2 .8 .1 .9 .1 .9
.4 .6 .1 .9
.6 .4
.9
.8 .9
.5
.7 .5 .8 .9 .7 .5
.4 .5 .7 .2.88 .54 .89 .52
.352 .27 .623 .104
.3028 .1559
.24408
.3028 .1559
A
D
B
CE
[ ]
[A]
[AB]
[BC]
[AB]
Context
B C Value0 0 .80 1 .91 0 .71 1 .1
Cache table for D
Result: P(D=1,E=0)
Context-minimal AND/OR graphC
0
K
0
H
0
L0 1
N N
0 1 0 1
F F F
1 1
0 1 0 1
F
G
0 1
1
A0 1
B B
0 1 0 1
E EE E
0 1 0 1
J JJ J
0 1 0 1
A0 1
B B
0 1 0 1
E EE E
0 1 0 1
J JJ J
0 1 0 1
G
0 1
G
0 1
G
0 1
M
0 1
M
0 1
M
0 1
M
0 1
P
0 1
P
0 1
O
0 1
O
0 1
O
0 1
O
0 1
L0 1
N N
0 1 0 1
P
0 1
P
0 1
O
0 1
O
0 1
O
0 1
O
0 1
D
0 1
D
0 1
D
0 1
D
0 1
K
0
H
0
L0 1
N N
0 1 0 1
1 1
A0 1
B B
0 1 0 1
E EE E
0 1 0 1
J JJ J
0 1 0 1
A0 1
B B
0 1 0 1
E EE E
0 1 0 1
J JJ J
0 1 0 1
P
0 1
P
0 1
O
0 1
O
0 1
O
0 1
O
0 1
L0 1
N N
0 1 0 1
P
0 1
P
0 1
O
0 1
O
0 1
O
0 1
O
0 1
D
0 1
D
0 1
D
0 1
D
0 1
B A
C
E
F G
HJ
D
K M
L
N
OP
C
HK
D
M
F
G
A
B
E
J
O
L
N
P
[AB]
[AF][CHAE]
[CEJ]
[CD]
[CHAB]
[CHA]
[CH]
[C]
[ ]
[CKO]
[CKLN]
[CKL]
[CK]
[C]
(C K H A B E J L N O D P M F G)
How big is the context?
Theorem: The maximum context size for a pseudo tree is equal to the treewidth of the graph along the pseudo tree.
C
HK
D
M
F
G
A
B
E
J
O
L
N
P
[AB]
[AF][CHAE]
[CEJ]
[CD]
[CHAB]
[CHA]
[CH]
[C]
[ ]
[CKO]
[CKLN]
[CKL]
[CK]
[C]
(C K H A B E J L N O D P M F G)
B A
C
E
F G
H
J
D
K M
L
N
OP
max context size = treewidth
Treewidth vs. pathwidth
G
EK
F
L
H
C
BA
M
J
D
EK
L
H
C
A
M
J
ABC
BDEF
BDFG
EFH
FHK
HJ KLM
treewidth = 3 = (max cluster size) - 1
ABC
BDEFG
EFH
FHKJ
KLM
pathwidth = 4 = (max cluster size) - 1
D
G
B
F
TREE
CHAIN
AND/OR graph search
• AO(i): searches depth-first, cache i-context i = the max size of a cache table (i.e. number of
variables in a context)
i=0 i=w*
Space: O(n)
Time: O(exp(w* log n))
Space: O(exp w*)
Time: O(exp w*)Space: O(exp(i) )
Time: O(exp(m_i+i )
i
Complexity of AND/OR graph search
AND/OR graph OR graph
Space O(n kw*) O(n kpw*)
Time O(n kw*) O(n kpw*)
k = domain sizen = number of variablesw*= treewidthpw*= pathwidth
w* ≤ pw* ≤ w* log n
Related work• Recursive Conditioning (RC) (Darwiche01)
Can be viewed as an AND/OR graph search algorithm guided by tree
Guiding tree structure is called “dtree”
• Value Elimination (VE) (Bacchus et al.03) Also an AND/OR graph search algorithm using an
advanced caching scheme based on components rather than graph-based contexts
Can use dynamic variable orderings
Exact inference• Variable elimination (inference)
Bucket elimination Bucket-Tree elimination Cluster-Tree elimination
• Conditioning (search) VE+C hybrid AND/OR search spaces
• AND/OR tree search• AND/OR graph search
AND/OR w-cutset
A
C
B K
G L
D F
H
M
J
E
AC
B K
G
L
D
FH
M
J
E
A
C
B K
G L
D F
H
M
J
E
C
B K
G
L
D
FH
M
J
E
3-cutset
A
C
B K
G L
D F
H
M
J
E
C
K
G
L
D
FH
M
J
E
2-cutset
A
C
B K
G L
D F
H
M
J
E
L
D
FH
M
J
E
1-cutset
AND/OR w-cutset
AC
B K
G
L
D
FH
M
J
E
AC
B K
G
L
D
FH
M
J
E
AC
B K
G
L
D
FH
M
J
E
pseudo tree 1-cutset treemoral graph
Searching AND/OR graphs
• AO(i): searches depth-first, cache i-context i = the max size of a cache table (i.e. number of
variables in a context)
i=0 i=w*
Space: O(n)
Time: O(exp(w* log n))
Space: O(exp w*)
Time: O(exp w*)Space: O(exp(i) )
Time: O(exp(m_i+i )
i
w-cutset trees over AND/OR space
• Definition: T_w is a w-cutset tree relative to backbone pseudo tree T, iff T_w roots
T and when removed, yields treewidth w.
• Theorem: AO(i) time complexity for backbone T is time O(exp(i+m_i)) and space
O(i), m_i is the depth of the T_i tree.
• Better than w-cutset: O(exp(i+c_i)) where c_i is the number of nodes in T_i
Exact inference• Variable elimination (inference)
Bucket elimination Bucket-Tree elimination Cluster-Tree elimination
• Conditioning (search) VE+C hybrid AND/OR search for Most Probable Explanations
AND/OR Branch-and-Bound for MPE
• Solved by BE in time and space exponential in treewidth w*
• Solved by Conditioning in linear space and time exponential in the number of variables n
• It can be solved by AND/OR search: Tree search: space O(n), time O(exp(w* log n)) Graph search: time and space O(exp(w*))
n
iiiXX paXPMPE
n1
,..., |max1
Weighted AND/OR tree (MPE task)
0
A
B
0
E
OR
AND
OR
AND
OR
AND
C
0
OR
AND
D
0 1
1
D
0 1
0 1
1
E C
0
D
0 1
1
D
0 1
0 1
1
B
0
E C
0
D
0 1
1
D
0 1
0 1
1
E C
0
D
0 1
1
D
0 1
0 1
A
D
B C
E
A
D
B
CE
B C D=0 D=10 0 .2 .80 1 .1 .91 0 .3 .71 1 .5 .5
),|( CBDP
.7.8 .9 .5 .7.8 .9 .5
Evidence: D=1
A B E=0 E=10 0 .4 .60 1 .5 .51 0 .7 .31 1 .2 .8
),|( BAEP
Evidence: E=0
.4 .5 .7 .2
A B=0 B=10 .4 .61 .1 .9
)|( ABPA C=0 C=10 .2 .81 .7 .3
)|( ACPA P(A)0 .61 .4
)(AP
.2 .8 .2 .8 .1 .9 .1 .9
.4 .6 .1 .9
.6 .4
D: P(D|B,C) D=1C: P(C|A)E: P(E|A,B) E=0B: P(B|A)A: P(A)
w(X,x) = product of CPTs that contain X and their scope is fully instantiated along the path
Computing node values (MPE task)
),(),(maxA iAviAwv i OR node
1
A
2 k
w(A,1)w(A,2)
w(A,k)
v(A,1) v(A,1) v(A,1)…
AND node
0
X1 X2 Xm…v(X1) v(X2) v(Xm)
m
iiXvv
1
0,A
NOTE: • the value of a terminal AND node is 1• the weight of an OR-AND arc for which no CPTs are fully instantiated is 1
AND/OR tree algorithm (MPE task)
AND node: Combination operator (product)
OR node: Marginalization operator (maximization)
Value of node = MPE value for sub-problem below
0
A
B
0
E
OR
AND
OR
AND
OR
AND
C
0
OR
AND
D
0 1
1
D
0 1
0 1
1
E C
0
D
0 1
1
D
0 1
0 1
1
B
0
E C
0
D
0 1
1
D
0 1
0 1
1
E C
0
D
0 1
1
D
0 1
0 1
A
D
B C
E
A
D
B
CE
B C D=0 D=10 0 .2 .80 1 .1 .91 0 .3 .71 1 .5 .5
),|( CBDP
.7.8 .9 .5 .7.8 .9 .5
Evidence: D=1
A B E=0 E=10 0 .4 .60 1 .5 .51 0 .7 .31 1 .2 .8
),|( BAEP
Evidence: E=0
.4 .5 .7 .2
A B=0 B=10 .4 .61 .1 .9
)|( ABPA C=0 C=10 .2 .81 .7 .3
)|( ACPA P(A)0 .61 .4
)(AP
.2 .8 .2 .8 .1 .9 .1 .9
.4 .6 .1 .9
.6 .4
.8 .9
.8 .9
.7 .5
.7 .5
.8 .9
.8 .9
.7 .5
.7 .5
.4 .5 .7 .2.72 .40 .81 .45
.288 .20 .567 .09
.12 .081
.072
.12 .081
Result: MPE(D=1,E=0)
MAX( 0.12*0.6 , 0.081*0.4 )= 0.072
Branch-and-Bound search
n
g(n) cost of thesearch path to n
h(n) estimates theoptimal cost below n
UB(n) = g(n) * h(n)
Upper Bound UB(n)
OR Search Tree
Prune if UB(n) ≤ LB
Lower Bound LB
(Lawler & Wood66)
Partial solution tree
0
D
0
(A=0, B=0, C=0, D=0)
0
A
B C
0
0
A
B C
00
D
1
(A=0, B=0, C=0, D=1)
0
A
B C
01
D
0
(A=0, B=1, C=0, D=0)
0
A
B C
01
D
1
(A=0, B=1, C=0, D=1)
A
B C
D
Pseudo tree
Extension(T’) – solution trees that extend T’
Exact evaluation function
OR
AND
OR
AND
OR
OR
AND
AND
A
0
B
0
D
E E
0 1 0 1
0 1
C
1
1
6 4 8 54 5
4 5
2 4
9
9
2 5 0 0
0 0
0
1
0
0
D
0
C
1
v(D,0)
3
3 5 00
9
tip nodes
F
1
3
3 50
F v(F)
A
B
C
D E
F
A
B
C D
E
F
A B C f1(ABC)
0 0 0 20 0 1 50 1 0 30 1 1 51 0 0 91 0 1 31 1 0 71 1 1 2
A B F f2(ABF)
0 0 0 30 0 1 50 1 0 10 1 1 41 0 0 61 0 1 51 1 0 61 1 1 5
B D E f3(BDE)
0 0 0 60 0 1 40 1 0 80 1 1 51 0 0 91 0 1 31 1 0 71 1 1 4
f*(T’) = w(A,0) * w(B,1) * w(C,0) * w(D,0) * v(D,0) * v(F)
Heuristic evaluation function
OR
AND
OR
AND
OR
OR
AND
AND
A
0
B
0
D
E E
0 1 0 1
0 1
C
1
1
6 4 8 54 5
4 5
2 4
9
9
2 5 0 0
0 0
0
1
0
0
D
0
C
1
h(D,0) = 4
3
3 5 00
9
tip nodes
F
1
3
3 50
F h(F) = 5
A
B
C
D E
F
A
B
C D
E
F
A B C f1(ABC)
0 0 0 20 0 1 50 1 0 30 1 1 51 0 0 91 0 1 31 1 0 71 1 1 2
A B F f2(ABF)
0 0 0 30 0 1 50 1 0 10 1 1 41 0 0 61 0 1 51 1 0 61 1 1 5
B D E f3(BDE)
0 0 0 60 0 1 40 1 0 80 1 1 51 0 0 91 0 1 31 1 0 71 1 1 4
f(T’) = w(A,0) * w(B,1) * w(C,0) * w(D,0) * h(D,0) * h(F) ≥ f*(T’)
h(n) ≥ v(n)
AND/OR Branch-and-Bound searchOR
AND
OR
AND
OR
OR
AND
AND
A
0
B
0
D
E E
0 1 0 1
0 1
C
1
1
1
0
D
E E
0 1 0 1
0 1
C
10
B
0 1
f(T’) ≤ LB
LB (Marinescu and Dechter, 05)
AND/OR Branch-and-Bound search• Associate each node n with a heuristic upper
bound h(n) on v(n)• EXPAND (top-down)
Evaluate f(T’) of the current partial solution sub-tree T’, and prune search if f(T’) ≤ LB
Expand the tip node n by generating its successors• PROPAGATE (bottom-up)
Update value of the parent p of n• OR nodes: maximization• AND nodes: product
How to Generate Heuristics• The principle of relaxed models
Mini-Bucket Elimination for belief networks(Pearl86)
Grid Networks (BN)
Grid(w*, h)(n, e)
SamIamv. 2.3.2
time
MBE(i)BB+SMB(i)
AOBB+SMB(i)BB+DMB(i)
AOBB+DMB(i)i=10
MBE(i)BB+SMB(i)
AOBB+SMB(i)BB+DMB(i)
AOBB+DMB(i)i=14
MBE(i)BB+SMB(i)
AOBB+SMB(i)BB+DMB(i)
AOBB+DMB(i)i=18
MBE(i)BB+SMB(i)
AOBB+SMB(i)BB+DMB(i)
AOBB+DMB(i)i=20
time nodes time nodes time nodes time nodes 90-24-1(36, 61)(576, 20)
-
0.14----
----
0.89-
1500.66--
-
24,117,151--
7.61-
93.73-
1979.42
-
1,413,764-
1,228
31.26-
111.46-
2637.71
-
1,308,009-
598 90-26-1(35, 64)(676, 40)
-
0.16-
1533.11-
1852.27
-
17,899,574-
177,661
1.02-
242.37--
-
3,205,257--
11.74-
21.48-
2889.49
-
165,182-
1,191
36.1670.5336.49
--
327,859
5,777--
90-30-1(38, 68)(900, 60)
-
0.25----
----
1.35-
239.08--
-
3,324,942--
13.34-
101.10--
-
1,358,569--
50.53-
87.68--
-
485,300--
Min-fill pseudo tree. Time limit 1 hour.
(Sang et al.05)
Genetic Linkage Analysis (BN)
pedigree(n, d)(w*, h)
Superlink
v. 1.6
time
SamIam
v. 2.3.2
time
MBE(i)BB+SMB(i)
AOBB+SMB(i)i=12
MBE(i)BB+SMB(i)
AOBB+SMB(i)i=16
MBE(i)BB+SMB(i)
AOBB+SMB(i)i=20
time nodes time nodes time nodesped18(1184, 5)(21, 119)
139.06
157.05
0.51--
--
4.59-
270.96
-
2,555,078
19.30-
20.27
-
7,689ped25(994, 5)(29, 53)
-
out
0.34--
--
3.20--
--
33.42-
1894.17
-
11,709,153ped30(1016, 5)(25, 51)
13095.83
out
0.31-
5563.22
-
63,068,960
2.66-
1811.34
-
20,275,620
24.88-
82.25
-
588,558ped33(581, 5)(26, 48)
-
out
0.41-
2335.28
-
32,444,818
5.28-
62.91
-
807,071
51.24-
76.47
-
320,279ped39(1272, 5)(23, 94)
322.14
out
0.52--
--
8.41-
4041.56
-
52,804,044
81.27-
141.23
-
407,280
(Fishelson&Geiger02)
Min-fill pseudo tree. Time limit 3 hours.
Memory intensive AND/OR Branch-and-Bound
• Associate each node n with a heuristic upper bound h(n) on v(n)
• EXPAND (top-down) Evaluate f(T’) of the current partial solution sub-tree T’, and
prune search if f(T’) ≤ LB If not in cache, expand the tip node n by generating its
successors• PROPAGATE (bottom-up)
Update value of the parent p of n• OR nodes: maximization• AND nodes: multiplication
Cache value of n, based on context
Best-first AND/OR search for MPE• Best-first search expands first the node with
the best heuristic evaluation function among all nodes encountered so far
• It never expands nodes whose cost is beyond the optimal one, unlike depth-first search algorithms (Dechter & Pearl85)
• Superior among memory intensive algorithms employing the same heuristic function
Best-First AND/OR Search• Maintains the set of best partial solution trees• EXPAND (top-down)
Traces down marked connectors from root (best partial solution tree) Expands a tip node n by generating its successors n’ Associate each successor with heuristic estimate h(n’)
• Initialize v(n’) = h(n’)
• REVISE (bottom-up) Updates node values v(n)
• OR nodes: maximization• AND nodes: multiplication
Marks the most promising solution tree from the root Label the nodes as SOLVED:
• OR is SOLVED if marked child is SOLVED• AND is SOLVED if all children are SOLVED
• Terminate when root node is SOLVED
[specializes Nilsson’s AO* to graphical models (Nilsson80)]
(Marinescu & Dechter, 07)
Grid Networks (BN)
grid (w*, h)(n, e)
SamIam
MBE(i)BB-C+SMB(i)AOBB+SMB(i)
AOBB-C+SMB(i)AOBF-C+SMB(i)
i=12
MBE(i)BB-C+SMB(i)AOBB+SMB(i)
AOBB-C+SMB(i)AOBF-C+SMB(i)
i=14
MBE(i)BB-C+SMB(i)AOBB+SMB(i)
AOBB-C+SMB(i)AOBF-C+SMB(i)
i=16
MBE(i)BB-C+SMB(i)AOBB+SMB(i)
AOBB-C+SMB(i)AOBF-C+SMB(i)
i=18
time nodes time nodes time nodes time nodes 90-24-1(33, 111)(576, 20)
out
0.28---
out
---
0.64-
2338.671273.09
21.94
-
24,117,1519,047,518
75,637
1.69-
1548.09596.27
10.59
-
18,238,9834,923,760
33,770
4.60-
138.6770.42
6.06
-
1,413,764473,675
5,144 90-34-1(45, 153)(1154, 80)
out
0.63---
out
---
1.25---
out
---
3.72---
243.63
---
596,978
11.66---
270.88
---
667,013 90-38-1(47, 163)(1444, 120)
out
0.78-
2032.33969.02101.69
-
6,835,7452,623,971
174,786
1.67--
1753.10103.80
--
3,794,053146,237
4.20-
807.38203.67
54.00
-
2,850,393614,868
95,511
12.36-
568.69165.45
53.44
-
2,079,146488,873
78,431Min-fill pseudo tree. Time limit 1 hour.
Solving the MAP task
• Solved by BE in time and space exponential in constrained induced width w*
• Solved by AND/OR search: Tree search: space O(n), time O(exp(w* log n)) Graph search: time and space O(exp(w*))
AXa
k exPaa\
**1 ),(maxarg),...,(MAP
Bucket elimination for MAP
A
B C
ED
P(A)
P(B|A)
P(E|B,C)
P(D|A,B)
P(C|A)
A
B C
ED
Moralize (marry parents)
Variables A and B are the hypothesis variables, variable E is evidence
0,,,, ),,,,(max0,,maxedcbaba edcbaPebaPMAP
c d eba cbePbadPacPabPaPMAP0
),|(),|()|()|(max)(max
Bucket elimination for MAP
Bucket E:
Bucket D:
Bucket C:
Bucket B:
Bucket A:
P(E|B,C), E = 0
P(D|A,B)
P(A)
λE(B, C)
λC(A,B)λD(A, B)
λB(A)
MAP value
P(C|A)
P(B|A)
SUM buckets
MAX buckets
Bucket elimination for MAP• Elimination order is important: SUM variables are eliminated
first, followed by the MAX variables ordering: A, B, C, D, E is legal ordering: A, C, D, E, B is illegal
• Induced width corresponding to a legal elimination order is called constrained induced width cw* Typically it may be far larger than the unconstrained induced width,
ie cw* ≥ w*
• When interleaving MAX and SUM (using unconstrained orderings) the result is an Upper Bound on the MAP value Can be used as a guiding heuristic function for search
AND/OR tree algorithm for MAP
AND node: Combination operator (product)
OR node: MAX for hypothesis, SUM otherwise
0
A
B
0
E
OR
AND
OR
AND
OR
AND
C
0
OR
AND
D
0 1
1
D
0 1
0 1
1
E C
0
D
0 1
1
D
0 1
0 1
1
B
0
E C
0
D
0 1
1
D
0 1
0 1
1
E C
0
D
0 1
1
D
0 1
0 1
A
D
B C
E
A
D
B
CE
B C D=0 D=10 0 .2 .80 1 .1 .91 0 .3 .71 1 .5 .5
),|( CBDP
.7.8 .9 .5 .7.8 .9 .5
Evidence: D=1
A B E=0 E=10 0 .4 .60 1 .5 .51 0 .7 .31 1 .2 .8
),|( BAEP
Evidence: E=0
.4 .5 .7 .2
A B=0 B=10 .4 .61 .1 .9
)|( ABPA C=0 C=10 .2 .81 .7 .3
)|( ACPA P(A)0 .61 .4
)(AP
.2 .8 .2 .8 .1 .9 .1 .9
.4 .6 .1 .9
.6 .4
.8 .9
.8 .9
.7 .5
.7 .5
.8 .9
.8 .9
.7 .5
.7 .5
.4 .5 .7 .2.88 .54 .89 .52
.352 .27 .623 .104
.162 .0936
.0972
.162 .0936
Result: MAP(D=1,E=0)
MAX( 0.162*0.6 , 0.0936*0.4 )= 0.0972
AND/OR search for MAP• Pseudo tree must be consistent with the
constrained elimination order• Graph search via context-based caching
• Time and space complexity Tree search:
• Space linear, time O(exp(cw*log n)) Graph search:
• Time and space O(exp(cw*))
Outline• Probabilistic modeling with joint distributions• Conditional independence and factorization• Belief networks• Inference in belief networks
Exact inference Approximate inference
Approximate inference• Mini-Bucket Elimination
Mini-clustering• Iterative Belief Propagation
IJGP – Iterative Joint Graph Propagation• Sampling
Forward sampling Gibbs sampling (MCMC) Importance sampling
Solution techniques
Search: Conditioning
Complete
Incomplete
Gradient Descent
Complete
Incomplete
Tree ClusteringVariable Elimination
Mini-Clustering(i)Mini-Bucket(i)
Stochastic Local SearchDFS search
Inference: Elimination
Time: exp(treewidth)Space:exp(treewidth)
Time: exp(n)Space: linear
AND/OR searchTime: exp(treewidth*log n)Space: linear
Hybrids
Space: exp(treewidth)Time: exp(treewidth)
Time: exp(pathwidth)Space: exp(pathwidth)
Belief Propagation
Bucket Elimination
Variable elimination (MPE)
A
B C
ED
P(A)
P(C|A)P(B|A)
P(E|B,C)
P(D|A,B)
MPE = ?
maxA,E=0,D,C,B P(A) P(B|A) P(C|A) P(D|A,B) P(E|B,C) =
maxAP(A) maxE=0 maxD maxC P(C|A) maxB P(B|A) P(D|A,B) P(E|B,C)
λB(A,D,C,E)Variable Elimination
Given a belief network and some evidence
Bucket elimination (MPE)
Bucket B:
Bucket C:
Bucket D:
Bucket E:
Bucket A:
P(E|B,C), P(D|A,B), P(B|A)
P(C|A)
E=0
P(A)
max∏ Elimination operator
λB(A,D,C,E)
λC(A,D,E)
λD(A,E)
λE(A)
MPE
B
C
D
E
A
w* = 4“induced width”(max clique size)
width
4
3
1
1
0
MBE: Mini-Bucket Elimination• Computation in a bucket is time and space
exponential in the number of variables involved (i.e., width)
• Therefore, partition functions in a bucket into “mini-buckets” on smaller number of variables
• The idea is similar to i-consistency: bound the size of recorded dependencies (Dechter 2003)
Idea: MPE task
XX gh
Split a bucket into mini-buckets => bound complexity
)()()O(e :decrease complexity lExponentia n rnr eOeO
MBE(i=3) in action for MPE
Bucket B:
Bucket C:
Bucket D:
Bucket E:
Bucket A:
P(E|B,C) P(D|A,B), P(B|A)
P(C|A)
E=0
P(A)
λB(C,E)
λC(A,D,E)
Upper Bound on MPE value
λE(A)
λB(A,D)
λD(A,E)
4 variables: split
3 variables: OK
3 variables: OK
2 variables: OK
1 variable: OK
Mini-bucketsmax∏max∏
MBE(i=3) in action for MPEBucket B:
Bucket C:
Bucket D:
Bucket E:
Bucket A:
P(E|B,C), P(D|A,B), P(B|A)
P(C|A)
E=0
P(A)
λB(C,E)
λC(A,D,E)
λE(A)
λB(A,D)
λD(A,E)
a’ = argmax P(A) ∙ λE(A)
e’ = 0
d’ = argmax λC(a’,D,e’) ∙ ∙ λC(a’,D)
c’ = argmax P(C|a’) ∙∙ λC(C,e’)
b’ = argmax P(e’|B,c’) ∙ ∙ P(d’|a’,B) P(B|a’)∙
Return (a’, b’, c’, d’, e’)A Lower Bound can also be computed as the probability of
the sub-optimal assignment P(a’, b’, c’, d’, e’)
MBE(i=3) for probability of evidence
Bucket B:
Bucket C:
Bucket D:
Bucket E:
Bucket A:
P(E|B,C) P(D|A,B), P(B|A)
P(C|A)
E=0
P(A)
λB(C,E)
λC(A,D,E)
Upper Bound on P(evidence)
λE(A)
λB(A,D)
λD(A,E)
4 variables: split
3 variables: OK
3 variables: OK
2 variables: OK
1 variable: OK
Mini-buckets∑∏∑∏
MBE(i) for probability of evidence• If we process all mini-buckets by summation
then we get an unnecessarily large upper bound on the probability of evidence
• Tighter upper bound Process first mini-bucket by summation and
remaining ones by maximization• We can also get a lower bound on P(evidence)
Process first mini-bucket by summation and remaining ones by minimization
Properties of MBE(i)• Controlling parameter i (called i-bound)
Maximum number of distinct variables in a mini-bucket Outputs both a lower and an upper bound
• Complexity: O(exp(i)) time and space• As i-bound increases, both accuracy and time complexity
increase Clearly, if i = w*, then we have pure BE
• Possible use of mini-bucket approximations As anytime algorithms (Dechter & Rish, 1997) As heuristic functions for depth-first and best-first search (Kask
& Dechter, 2001), (Marinescu & Dechter, 2005)
Mini-Bucket Heuristics• Static Mini-Buckets
Pre-compiled Reduced overhead Less accurate Static variable ordering
• Dynamic Mini-Buckets Computed dynamically Higher overhead High accuracy Dynamic variable ordering
Heuristic evaluation function
OR
AND
OR
AND
OR
OR
AND
AND
A
0
B
0
D
E E
0 1 0 1
0 1
C
1
1
6 4 8 54 5
4 5
2 4
9
9
2 5 0 0
0 0
0
1
0
0
D
0
C
1
h(D,0) = 4
3
3 5 00
9
tip nodes
F
1
3
3 50
F h(F) = 5
A
B
C
D E
F
A
B
C D
E
F
A B C f1(ABC)
0 0 0 20 0 1 50 1 0 30 1 1 51 0 0 91 0 1 31 1 0 71 1 1 2
A B F f2(ABF)
0 0 0 30 0 1 50 1 0 10 1 1 41 0 0 61 0 1 51 1 0 61 1 1 5
B D E f3(BDE)
0 0 0 60 0 1 40 1 0 80 1 1 51 0 0 91 0 1 31 1 0 71 1 1 4
f(T’) = w(A,0) * w(B,1) * w(C,0) * w(D,0) * h(D,0) * h(F) ≥ f*(T’)
h(n) ≥ v(n)
Bucket eliminationA
f(A,B)B
f(B,C)C f(B,F)F
f(A,G) f(F,G)
Gf(B,E) f(C,E)
Ef(A,D) f(B,D) f(C,D)
D
hG (A,F)
hF (A,B)
hB (A)
hE (B,C)hD (A,B,C)
hC (A,B)
A B
CD
E
F
G
A
B
C F
GD E
Ordering: (A, B, C, D, E, F, G)
h*(a, b, c) = hD(a, b, c) * hE(b, c)
(Dechter99)
Static mini-bucket heuristics
A
f(A,B)B
f(B,C)C f(B,F)F
f(A,G) f(F,G)
Gf(B,E) f(C,E)
Ef(B,D) f(C,D)
D
hG (A,F)
hF (A,B)
hB (A)
hE (B,C)hD (B,C)
hC (B)
hD (A)
f(A,D)D
mini-buckets
A B
CD
E
F
G
A
B
C F
GD E
Ordering: (A, B, C, D, E, F, G)
h(a, b, c) = hD(a) * hD(b, c) * hE(b, c) ≥ h*(a, b, c)
MBE(3)
Dynamic mini-bucket heuristics
A
f(a,b)B
f(b,C)C f(b,F)F
f(a,G) f(F,G)
Gf(b,E) f(C,E)
Ef(a,D) f(b,D) f(C,D)
D
hG (F)
hF ()
hB ()
hE (C)hD (C)
hC ()
A B
CD
E
F
G
A
B
C F
GD E
Ordering: (A, B, C, D, E, F, G)
h(a, b, c) = hD(c) * hE(c) = h*(a, b, c)
MBE(3)
Static vs. Dynamic Mini-Bucket Heuristics
s1196 ISCAS’89 circuit.
Approximate inference• Mini-Bucket Elimination
Mini-clustering (tree decompositions)• Iterative Belief Propagation
IJGP – Iterative Joint Graph Propagation• Sampling
Forward sampling Gibbs sampling (MCMC) Importance sampling Particle filtering
Cluster Tree Elimination (CTE)• Correctness and completeness:
Algorithm CTE is correct, i.e. it computes the exact posterior joint probability of all single variables (or subsets) and the evidence.
• Time complexity: O ( deg (n+N) d w*+1 )
• Space complexity: O ( N d sep)where deg = the maximum degree of a node
n = number of variables (= number of CPTs)N = number of nodes in the tree decompositiond = the maximum domain size of a variablew* = the induced widthsep = the separator size
Cluster Tree Elimination - messages
),|()|()(),()2,1( bacpabpapcbha
A B C p(a), p(b|a), p(c|a,b)
B C D Fp(d|b), p(f|c,d)
h(1,2)(b,c)
B E Fp(e|b,f), h(2,3)(b,f)
E F Gp(g|e,f)
),(),|()|(),( )2,1(,
)3,2( cbhdcfpbdpfbhdc
2
4
1
3
EF
BC
BFsep(2,3)={B,F}
elim(2,3)={C,D}
G
E
F
C D
B
A
Mini-Clustering for belief updating• Motivation:
Time and space complexity of Cluster Tree Elimination depend on the induced width w* of the problem
When the induced width w* is big, CTE algorithm becomes infeasible
• The basic idea: Try to reduce the size of the cluster (the exponent);
partition each cluster into mini-clusters with less variables Accuracy parameter i = maximum number of variables in a
mini-cluster The idea was explored for variable elimination (MBE)
Idea of Mini-Clustering
Split a cluster into mini-clusters => bound complexity
)()( :decrease complexity lExponentia rnrn eOeO)O(e
},...,,,...,{ 11 nrr hhhh )(ucluster
elim
n
iihh
1
},...,{ 1 rhh },...,{ 1 nr hh
elim
n
rii
elim
r
ii hhg
11
gh
Mini-Clustering (MC)
),|()|()(),(1)2,1( bacpabpapcbh
a
A B C p(a), p(b|a), p(c|a,b)
B E Fp(e|b,f)
E F Gp(g|e,f)
2
4
1
3EF
BC
BF
dc
dcfpfh,
2)3,2( ),|()(
),()|()( 1)2,1(
,
1)3,2( cbhbdpbh
dc
dc
dcfpcbhbdpfbh,
1)2,1()3,2( ),|(),()|(),(
),|()|()(),()2,1( bacpabpapcbha
Cluster Tree Elimination Mini-Clustering, i=3
G
E
F
C D
B
A
B C D Fp(d|b), p(f|c,d)2
B C D Fp(d|b), h(1,2)(b,c), p(f|c,d)
sep(2,3) = {B,F}elim(2,3) = {C,D}
B C D C D F p(d|b), h(1,2)(b,c) p(f|c,d)
EF
BF
BC
),|()|()(:),(1)2,1( bacpabpapcbh
a
)2,1(H
),|(max:)(
),()|(:)(
,
2)1,2(
1)2,3(
,
1)1,2(
dcfpch
fbhbdpbh
fd
fd
)1,2(H
),|(max:)(
),()|(:)(
,
2)3,2(
1)2,1(
,
1)3,2(
dcfpfh
cbhbdpbh
dc
dc
)3,2(H
),(),|(:),( 1)3,4(
1)2,3( fehfbepfbh
e
)2,3(H
)()(),|(:),( 2)3,2(
1)3,2(
1)4,3( fhbhfbepfeh
b
)4,3(H
),|(:),(1)3,4( fegGpfeh e)3,4(H
ABC
2
4
1
3 BEF
EFG
BCDF
Mini-Clustering - example
Mini-Clustering• Correctness and completeness:
Algorithm MC(i) computes a bound (or an approximation) on the joint probability P(Xi,e) of each variable and each of its values.
• Time & space complexity: O(exp(i))
Approximate inference• Mini-Bucket Elimination
Mini-clustering• Iterative Belief Propagation
IJGP – Iterative Joint Graph Propagation• Sampling
Forward sampling Gibbs sampling (MCMC) Importance sampling Particle filtering
Iterative Belief Propagation (IBP)• Belief propagation is exact for poly-trees (Pearl, 1988)• IBP - applying BP iteratively to cyclic networks
• No guarantees for convergence• Works well for many coding networks
)( 12xU
)( 11uX
1U 2U 3U
2X1X )( 12uX
)( 13xU
) BEL(U update :step One
1
Iterative Belief Propagation
A
ABDE
FGI
ABC
BCE
GHIJ
CDEF
FGH
C
H
A C
A AB BC
BE
C
CDE CE
F H
FFG GH H
GI
The graph IBP works on (dual graph)
A
D
I
B
E
J
F
G
C
H
Belief network
P(A)P(B|A,C)
P(C)
P(D|A,B,E) P(E|B,C)
P(F|C,D,E)P(G|H,F)
P(H)
P(I|F,G) P(J|H,G,I)
Iterative Join-Graph Propagation (IJGP)• IBP is applied to a loopy network iteratively
not an anytime algorithm when it converges, it converges very fast
• MC applies bounded inference along a tree decomposition MC is an anytime algorithm controlled by i-bound MC converges in two passes up and down the tree
• IJGP combines: the iterative feature of IBP the anytime feature of MC
IJGP - The basic idea Apply Cluster Tree Elimination to any join-graph
We commit to graphs that are minimal I-maps
Avoid cycles as long as I-mapness is not violated
Result: use minimal arc-labeled join-graphs
IJGP - ExampleA
D
I
B
E
J
F
G
C
H
A
ABDE
FGI
ABC
BCE
GHIJ
CDEF
FGH
C
H
A C
A AB BC
BE
C
CDE CE
F H
FFG GH H
GI
Belief network The graph IBP works on (dual graph)
Arc-minimal join-graph
A
ABDE
FGI
ABC
BCE
GHIJ
CDEF
FGH
C
H
A C
A AB BC
BE
C
CDE CE
F H
FFG GH H
GI
A
ABDE
FGI
ABC
BCE
GHIJ
CDEF
FGH
C
H
A
AB BC
CDE CE
H
FFG GH
GI
Minimal arc-labeled join-graph
A
ABDE
FGI
ABC
BCE
GHIJ
CDEF
FGH
C
H
A
AB BC
CDE CE
H
FFG GH
GI
A
ABDE
FGI
ABC
BCE
GHIJ
CDEF
FGH
C
H
A
AB BC
CDE CE
H
FF GH
GI
Join-graph decompositions
a) Minimal arc-labeled join graph
b) Join-graph obtained by collapsing nodes of graph a)
c) Minimal arc-labeled join graph
A
ABDE
FGI
ABC
BCE
GHIJ
CDEF
FGH
C
H
A
AB BC
CDE CE
H
FF GH
GI
ABCDE
FGI
BCE
GHIJ
CDEF
FGH
BC
CDE CE
FF GH
GI
ABCDE
FGI
BCE
GHIJ
CDEF
FGH
BC
DE CE
FF GH
GI
Tree decomposition
ABCDE
FGHI GHIJ
CDEF
CDE
F
GHI
a) Minimal arc-labeled join graph
b) Tree decomposition
ABCDE
FGI
BCE
GHIJ
CDEF
FGH
BC
DE CE
FF GH
GI
Join-graphsA
ABDE
FGI
ABC
BCE
GHIJ
CDEF
FGH
C
H
A C
A AB BC
BE
C
CDE CE
F H
FFG GH H
GI
A
ABDE
FGI
ABC
BCE
GHIJ
CDEF
FGH
C
H
A
AB BC
CDE CE
H
FF GH
GI
ABCDE
FGI
BCE
GHIJ
CDEF
FGH
BC
DE CE
FF GH
GI
ABCDE
FGHI GHIJ
CDEF
CDE
F
GHI
more accuracy
less complexity
Message propagationABCDE
FGI
BCE
GHIJ
CDEF
FGH
BC
CDECE
FF GH
GI
ABCDEp(a), p(c), p(b|ac), p(d|abe),p(e|b,c)
h(3,1)(bc)
BCD
CDEF
BC
CDE CE
1 3
2
h(3,1)(bc)
h(1,2)
Minimal arc-labeled: sep(1,2)={D,E} elim(1,2)={A,B,C}
Non-minimal arc-labeled: sep(1,2)={C,D,E} elim(1,2)={A,B}
cba
bchbcepabedpacbpcpapdeh,,
)1,3()2,1( )()|()|()|()()()(
ba
bchbcepabedpacbpcpapcdeh,
)1,3()2,1( )()|()|()|()()()(
Bounded decompositions• We want arc-labeled decompositions such that:
the cluster size (internal width) is bounded by i (the accuracy parameter)
the width of the decomposition as a graph (external width) is as small as possible – closer to a tree
• Possible approaches to build decompositions: partition-based algorithms - inspired by the mini-bucket
decomposition grouping-based algorithms
Partition-based algorithms
G
E
F
C D
B
A
a) schematic mini-bucket(i), i=3 b) minimal arc-labeled join-graph decomposition
CDB
CAB
BA
A
CBP(D|B)
P(C|A,B)
P(A)
BA
P(B|A)
FCD
P(F|C,D)
GFE
EBF
BF
EFP(E|B,F)
P(G|F,E)
B
CD
BF
A
F
G: (GFE)
E: (EBF) (EF)
F: (FCD) (BF)
D: (DB) (CD)
C: (CAB) (CB)
B: (BA) (AB) (B)
A: (A)
IJGP properties• IJGP(i) applies BP to min arc-labeled join-graph, whose cluster
size is bounded by i
• On join-trees IJGP finds exact beliefs!
• IJGP is a Generalized Belief Propagation algorithm (Yedidia, Freeman and Weiss, 2001)
• Complexity of one iteration: time: O(deg•(n+N) •d i+1) space: O(N•d)
Random networks - KL at convergence
evidence=0
Random networks, N=50, K=2, P=3, evid=0, w*=16, 100 instances
i-bound1 2 3 4 5 6 7 8 9 10 11
KL
dist
ance
1e-5
1e-4
1e-3
1e-2
IJGPMCIBP
Random networks, N=50, K=2, P=3, evid=5, w*=16, 100 instances
i-bound
1 2 3 4 5 6 7 8 9 10 11K
L di
stan
ce1e-5
1e-4
1e-3
1e-2
IJGPMCIBP
evidence=5
Random networks - KL vs. iterations
evidence=0 evidence=5
Random networks, N=50, K=2, P=3, evid=0, w*=16, 100 instances
Number of iterations
0 5 10 15 20 25 30 35
KL
dist
ance
1e-5
1e-4
1e-3
1e-2IJGP(2)IJGP(10)IBP
Number of iterations
0 5 10 15 20 25 30 35
KL
dist
ance
1e-5
1e-4
1e-3
1e-2
1e-1 IJGP(2)IJGP(10)IBP
Random networks, N=50, K=2, P=3, evid=5, w*=16, 100 instances
Random networks - Time
Random networks, N=50, K=2, P=3, evid=5, w*=16, 100 instances
i-bound
1 2 3 4 5 6 7 8 9 10 11
Tim
e (s
econ
ds)
0.0
0.2
0.4
0.6
0.8
1.0
IJGP 20 itMCIBP 10 it
IJGP summary• IJGP borrows the iterative feature from IBP and the anytime
virtues of bounded inference from MC
• Empirical evaluation showed the potential of IJGP, which improves with iteration and most of the time with i-bound, and scales up to large networks
• IJGP is almost always superior, often by a high margin, to IBP and MC
• Based on all our experiments, we think that IJGP provides a practical breakthrough to the task of belief updating
• #CSP: can use IJGP to generate solution counts estimates for depth-first Branch-and-Bound search
Approximate inference• Mini-Bucket Elimination
Mini-clustering• Iterative Belief Propagation
IJGP – Iterative Joint Graph Propagation• Sampling
Forward sampling Gibbs sampling (MCMC) Importance sampling
Approximation algorithms• Structural Approximations
Eliminate some dependencies• Remove edges• Mini-Bucket and Mini-Clustering approaches
• Local Search Approach for optimization tasks: MPE, MAP
• Favorite MAX-CSP/WCSP/WSAT local search solver!
• Sampling Generate random samples and compute values of interest
from samples, not original network
Sampling• Input: Bayesian network with set of nodes X• Sample = a tuple with assigned values
s=(X1=x1,X2=x2,… ,Xk=xk)
• Tuple may include all variables (except evidence) or a subset
• Sampling schemas dictate how to generate samples (tuples)
• Ideally, samples are distributed according to P(X|E)
Sampling fundamentals
dxXxggE )()(
Given a set of variables X = {X1, X2, … Xn} that represent joint probability distribution (X) and some function g(X), we can compute expected value of g(X) :
Sampling from (X)
Given independent, identically distributed samples (iid) S1, S2, …ST from (X), it follows from Strong Law of Large Numbers:
T
ttSg
Tg
1)(1
},...,,{ 21tn
ttt xxxS A sample St is an instantiation:
Sampling basics
• Given random variable X, D(X)={0, 1}• Given P(X) = {0.3, 0.7}• Generate k=10 samples: 0,1,1,1,0,1,1,0,1,0• Approximate P’(X):
}6.0,4.0{)('
6.0106
#)1(#)1('
4.0104
#)0(#)0('
XPsamples
XsamplesXP
samplesXsamplesXP
How to draw a sample ?
• Given random variable X, D(X)={0, 1}• Given P(X) = {0.3, 0.7}• Sample X P (X)
draw random number r [0, 1] If (r < 0.3) then set X=0 Else set X=1
• Can generalize for any domain size
Sampling in BN
• Same idea: generate a set of samples T• Estimate posterior marginal P(Xi|E) from
samples• Challenge: X is a vector and P(X) is a huge
distribution represented by BN• Need to know:
How to generate a new sample ? How many samples T do we need ? How to estimate P(E=e) and P(Xi|e) ?
Sampling algorithms
• Forward Sampling• Gibbs Sampling (MCMC)
Blocking Rao-Blackwellised
• Likelihood Weighting• Importance Sampling• Sequential Monte-Carlo (Particle Filtering) in
Dynamic Bayesian Networks
Forward sampling
• Forward Sampling Case with No evidence E={} Case with Evidence E=e
Forward sampling no evidence (Henrion 1988)
Input: Bayesian networkX= {X1,…,XN}, N- #nodes, T - # samples
Output: T samples Process nodes in topological order – first process the
ancestors of a node, then the node itself:1. For t = 1 to T2. For i = 1 to N3. Xi sample xi
t from P(xi | pai)
Sampling a value
What does it mean to sample xit from P(Xi | pai) ?
• Assume D(Xi)={0,1}• Assume P(Xi | pai) = (0.3, 0.7)
• Draw a random number r from [0,1]If r falls in [0,0.3], set Xi = 0If r falls in [0.3,1], set Xi = 1
0 10.3 r
Forward sampling (example)
X1
X4
X2X3
)( 1xP
)|( 12 xxP
),|( 324 xxxP
)|( 13 xxP
)|( from Sample .4)|( from Sample .3)|( from Sample .2
)( from Sample .1 sample generate//
Evidence No
3,244
133
122
11
xxxPxxxPxxxPx
xPxk
Forward Sampling-Answering Queries
Task: given T samples {S1,S2,…,Sn} estimate P(Xi = xi) :
TxXsamplesxXP ii
ii)(#)(
Basically, count the proportion of samples where Xi = xi
Forward sampling w/ evidenceInput: Bayesian network
X= {X1,…,XN}, N- #nodesE – evidence, T - # samples
Output: T samples consistent with E1. For t=1 to T2. For i=1 to N3. Xi sample xi
t from P(xi | pai)4. If Xi in E and Xi xi, reject sample: 5. i = 1 and go to step 2
Forward sampling (example)
)|( from Sample 5.otherwise 1, fromstart and
samplereject 0, If .4)|( from Sample .3)|( from Sample .2
)( from Sample .1 sample generate//
0 :Evidence
3,244
3
133
122
11
3
xxxPx
xxxPxxxPx
xPxk
X
X1
X4
X2X3
)( 1xP
)|( 12 xxP
),|( 324 xxxP
)|( 13 xxP
Forward sampling: illustration
Let Y be a subset of evidence nodes s.t. Y=u
Forward sampling – How many samples?
Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have:
1
)( 2
yPcT
Derived from Chebychev’s Bound.
222])(,)([)( NeyPyPyPP
Forward sampling: performance
Advantages:• P(xi | pa(xi)) is readily available• Samples are independent !
Drawbacks:• If evidence E is rare (P(e) is low), then we will reject
most of the samples!• Since P(y) in estimate of T is unknown, must estimate
P(y) from samples themselves!• If P(e) is small, T will become very big!
Problem: evidence!
• Forward Sampling High Rejection Rate
• Fix evidence values Gibbs sampling (MCMC) Likelihood Weighting Importance Sampling
Problem: Evidence
• Forward Sampling High rejection rate Samples are independent
• Fix evidence values Gibbs sampling (MCMC) Likelihood Weighting Importance Sampling
Sampling algorithms
• Forward Sampling• Gibbs Sampling (MCMC)
Blocking Rao-Blackwellised
• Likelihood Weighting• Importance Sampling
Gibbs Sampling• Markov Chain Monte Carlo method
(Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994)
• Samples are dependent, form Markov Chain• Sample from P’(X|e) which converges to P(X|e)• Guaranteed to converge when all P > 0• Methods to improve convergence:
Blocking Rao-Blackwellised
Gibbs Sampling (Pearl, 1988)
• A sample t[1,2,…], is an instantiation of all variables in the network:
• Sampling process Fix values of observed variables e Instantiate node values in sample x0 at random Generate samples x1,x2,…xT from P(X|e) Compute posteriors from samples
},...,,{ 2211tNN
ttt xXxXxXx
),,...,,|(
...),,...,,|(
),,...,,|(
11
12
11
1
31
121
22
3211
11
exxxxPxX
exxxxPxX
exxxxPxX
tN
ttN
tNN
tN
ttt
tN
ttt
Ordered Gibbs Sampler
Generate sample xt+1 from xt :
In short, for i=1 to N:
ProcessAll variablesIn SomeOrder
),\|( from sampled1 exxxPxX it
itii
Gibbs Sampling (Pearl, 1988)
iX
Markov blanket:
nodesother all oft independen is parents), their andchildren, (parents,
Given
iX
blanketMarkov
)()( jj chX
jiii pachpaXM
:)\|( )\|( :Important it
iit
i xmarkovxPxxxP
ij chX
jjiiit
i paxPpaxPxxxP )|()|()\|(
Ordered Gibbs Sampling Algorithm
Input: X, EOutput: T samples {xt }• Fix evidence E • Generate samples from P(X | E)1. For t = 1 to T (compute samples)2. For i = 1 to N (loop through variables)3. Xi sample xi
t from P(Xi | markovt \ Xi)
Answering Queries
• Query: P(xi |e) = ?• Method 1: count #of samples where Xi=xi:
• Method 2: average probability (mixture estimator):
n
t it
iiii XmarkovxXPT
xXP1
)\|(1)(
TxXsamplesxXP ii
ii)(#)(
Gibbs Sampling - example
X = {X1,X2,…,X9}E = {X9}
X1
X4
X8 X5 X2
X3
X9 X7
X6
Gibbs Sampling - example
X1 = x10 X6 = x6
0
X2 = x20 X7 = x7
0
X3 = x30 X8 = x8
0
X4 = x40
X5 = x50
X1
X4
X8 X5 X2
X3
X9 X7
X6
Gibbs Sampling - example
X1 P (X1 |X02,…,X0
8 ,X9)E = {X9}
P (X1=0 |X02,X0
3 ,X9} = αP(X1=0)P(X0
2|X1=0)P(X30|X1=0)
P (X1=1 |X02,X0
3 ,X9} = αP(X1=1)P(X0
2|X1=1)P(X30|X1=1)
X1
X4
X8 X5 X2
X3
X9 X7
X6
Gibbs Sampling - example
X2 P(X2 |X11,…,X0
8 ,X9}E = {X9}
Markov blanket for X2 is:{X2, X1, X4, X5, X3}
X1
X4
X8 X5 X2
X3
X9 X7
X6
Gibbs Sampling: Illustration
Gibbs Sampling: Burn-In• We want to sample from P(X | E)• But … starting point is random• Solution: throw away first K samples • Known As “Burn-In”• What is K ? Hard to tell. Use intuition.• Alternatives: sample first sample values from
approximate P(x|e) For example, run IBP first
Gibbs Sampling: Convergence• Converge to stationary distribution * :
* = * Pwhere P is a transition kernel
pij = P(Xi Xj)• Guaranteed to converge iff chain is :
irreducible aperiodic ergodic ( i,j pij > 0)
Gibbs Sampling: Performance• Advantage:
guaranteed to converge to P(X|E), as long as Pi > 0• Disadvantage:
convergence may be slow
• Problems: Samples are dependent ! Statistical variance is too big in high-dimensional
problems
Gibbs: Speeding ConvergenceObjectives:1. Reduce dependence between samples
(autocorrelation) Skip samples Randomize Variable Sampling Order
2. Reduce variance Blocking Gibbs Sampling Rao-Blackwellisation
Skipping Samples• Pick only every k-th sample (Gayer, 1992)
Can reduce dependence between samples! Increases variance ! Waists samples !
Randomized Variable Order• Random Scan Gibbs Sampler
Pick each next variable Xi for update at random with probability pi , i pi = 1.
• In the simplest case, pi are distributed uniformly. In some instances, reduces variance (MacEachern, Peruggia, 1999)
Blocking• Sample several variables together, as a block• Example: Given three variables X,Y,Z, with domains of
size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample:
Xt+1 P(yt,zt)=P(wt)(yt+1,zt+1)=Wt+1 P(xt+1)
+ Can improve convergence greatly when two variables are strongly correlated!
- Domain of the block variable grows exponentially with the #variables in a block!
Rao-Blackwellisation• Do not sample all variables!• Sample a subset!• Example: Given three variables X,Y,Z, sample
only X and Y, sum out Z. Given sample (xt,yt), compute next sample:
xt+1 P(yt)yt+1 P(xt+1)
Rao-Blackwell Theorem
Bottom line: reducing number of variables in a sample reduce variance!
Blocking vs. Rao-Blackwellisation
• Standard Gibbs:P(x|y,z),P(y|x,z),P(z|x,y) (1)
• Blocking:P(x|y,z), P(y,z|x) (2)
• Rao-Blackwellised:P(x|y), P(y|x) (3)
Var3 < Var2 < Var1 (Liu, Wong, Kong, 1994)
X Y
Z
Rao-Blackwellised Gibbs: Cutset Sampling
• Select C X (possibly cycle-cutset), |C| = m• Fix evidence E• Initialize nodes with random values:
For i=1 to m: ci to Ci = c 0
i
• For t=1 to n , generate samples:For i=1 to m:
Ci=cit+1 P(ci|c1
t+1,…,ci-1 t+1,ci+1
t,…,cmt ,e)
},...,,{ 2211tKK
ttt cCcCcCc
Cutset Sampling - generating samples
Generate sample ct+1 from ct :
),\|(
),,...,,|(
...),,...,,|(
),,...,,|(
1
11
12
11
1
31
121
22
3211
11
ecccPcC
eccccPcC
eccccPcC
eccccPcC
it
itii
tK
ttK
tKK
tK
ttt
tK
ttt
from sampled
Cutset Sampling• How to choose C ?
Special case: C is cycle-cutset, O(N) General case: apply Bucket Tree Elimination (BTE),
O(exp(w)) where w is the induced width of the network when nodes in C are observed.
Pick C wisely so as to minimize w notion of w-cutset
w-cutset Sampling• C=w-cutset of the network, a set of nodes
such that when C and E are instantiated, the adjusted induced width of the network is w
• Complexity of exact inference: bounded by w !
• Cycle-cutset is a special case!
Cutset Sampling - Answering Queries• Query: ci C, P(ci |e)=? same as Gibbs:• Special case of w-cutset
• Query: P(xi |e) = ?
computed while generating sample t
compute after generating sample t(easy because C is a cut-set)
T
t it
ii ecccPT
|e)(cP1
),\|(1
T
tt
ii ,ecxPT
|e)(xP1
)|(1
Cutset Sampling Example
}{ 05
02
0 ,xxc X1
X7
X5 X4
X2
X9 X8
X3
E=x9
X6
Cutset Sampling Example
),(
),(1)(
),(
),(
}{
905
''2
905
'2
9052
12
905
''2
905
'2
05
02
0
,xxxBTE
,xxxBTE,x| xxP x
,xxxBTE
,xxxBTE
,xx c
X1
X7
X6 X5 X4
X2
X9 X8
X3
Sample a new value for X2 :
Cutset Sampling Example
},{
),(
),(1)(
),(
),(
)(
},{
15
12
1
9''
512
9'5
12
9125
15
9''
512
9'5
12
9052
12
05
02
0
xxc
,xxxBTE
,xxxBTE,x| xxP x
,xxxBTE
,xxxBTE
,x| xxP x
xxc
X1
X7
X6 X5 X4
X2
X9 X8
X3
Sample a new value for X5 :
Cutset Sampling Example
)(
)(
)(
31)|(
)(
)(
)(
9252
9152
9052
92
9252
32
9152
22
9052
12
,x| xxP
,x| xxP
,x| xxP
xxP
,x| xxP x
,x| xxP x
,x| xxP x
X1
X7
X6 X5 X4
X2
X9 X8
X3
Query P(x2|e) for sampling node X2 :Sample 1
Sample 2
Sample 3
Cutset Sampling Example
),,|(
),,|(
),,|(
31)|(
),,|(},{
),,|(},{
),,|(},{
935
323
925
223
915
123
93
935
323
35
32
3
925
223
25
22
2
915
123
15
12
1
xxxxP
xxxxP
xxxxP
xxP
xxxxPxxc
xxxxPxxc
xxxxPxxc
X1
X7
X6 X5 X4
X2
X9 X8
X3
Query P(x3 |e) for non-sampled node X3 :
CPCS179 Test Results
MSE vs. #samples (left) and time (right) Non-Ergodic (1 deterministic CPT entry)|X| = 179, |C| = 8, 2<= D(Xi)<=4, |E| = 35Exact Time = 122 sec using Loop-Cutset Conditioning
CPCS179, n=179, |C|=8, |E|=35
0
0.002
0.004
0.006
0.008
0.01
0.012
100 500 1000 2000 3000 4000
# samples
Cutset Gibbs
CPCS179, n=179, |C|=8, |E|=35
0
0.002
0.004
0.006
0.008
0.01
0.012
0 20 40 60 80
Time(sec)
Cutset Gibbs
CPCS360b Test Results
MSE vs. #samples (left) and time (right) Ergodic, |X| = 360, D(Xi)=2, |C| = 21, |E| = 36Exact Time > 60 min using Cutset ConditioningExact Values obtained via Bucket Elimination
CPCS360b, n=360, |C|=21, |E|=36
0
0.00004
0.00008
0.00012
0.00016
0 200 400 600 800 1000
# samples
Cutset Gibbs
CPCS360b, n=360, |C|=21, |E|=36
0
0.00004
0.00008
0.00012
0.00016
1 2 3 5 10 20 30 40 50 60
Time(sec)
Cutset Gibbs
Sampling algorithms
• Forward Sampling• Gibbs Sampling (MCMC)
Blocking Rao-Blackwellised
• Likelihood Weighting• Importance Sampling
Likelihood Weighting(Fung and Chang, 1990; Shachter and Peot, 1990)
• “Clamping” evidence +• Forward sampling +• Weighting samples by evidence likelihood
Works well for likely evidence!
Likelihood Weighting
e e e e e
Sample in topological order over X !
e e e e
xi P(Xi|pai)P(Xi|pai) is a look-up in CPT!
Likelihood Weighting Outline
EndFor)|(
)|(
)(
Do ForEach 1
)(
)()(
)(
iit
ii
iitt
ii
i
i
t
paXPxX
ElsepaePww
eXEXIf
XXw
Likelihood Weighting
T
t
t
ti
T
t
t
ii
w
xxw
ePexPexP
1
)(
)(
1
)( ),(
)(ˆ),(ˆ
)|(ˆ
Estimate Posterior Marginals: P(Xi | e)
otherwise 0 and , contains sample if ,1),( )()(i
tti xxxx
Likelihood Weighting
• Converges to exact posterior marginals• Generates samples fast • Sampling distribution is close to prior
(especially if E Leaf Nodes)• Increasing sampling variance
Convergence may be slow Many samples with P(x(t))=0 rejected
Sampling algorithms
• Forward Sampling• Gibbs Sampling (MCMC)
Blocking Rao-Blackwellised
• Likelihood Weighting• Importance Sampling
Importance Sampling Idea• In general, it is hard to sample from target
distribution P(X|E)• Generate samples from sampling (proposal)
distribution Q(X)• Weigh each sample against P(X|E)
dxxfxQxPdxxffI t )()()()()(
Importance Sampling Theory
Z
EX
n
iii
EX
eEZPeEP
eXpaXPeEEXPeEP
),()(simplify E,\XLet Z
)),(|(),\()(\ 1\
Importance Sampling Theory
• Given a distribution called the proposal distribution Q (such that P(Z=z,e)>0 => Q(Z=z)>0)
Zz
eEzZPeEP ),()(
)()(
),()( zZQ
zZQeEzZP
eEPZz
Zz
Q zZzQZE )( :value expected of definition By
)()(
),()( zZwEzZQ
eEzZPEeEP QQ
w(Z=z) is called importance weight
Importance Sampling Theory
)()(
),()( zZwEzZQ
eEzZPEeEP QQ
)()(ˆ ,N
)(1)(
),(1)(ˆ
)z,...,(z Samples
Q fromdrawn samples ofset aGiven
11
n1
eEPeEPAs
zZwNzZQ
eEzZPN
eEPN
i
ii
N
ii
i
Underlying principle, Approximate Average over a set of numbers by an average over a set of sampled numbers
Importance Sampling (Informally)• Express the problem as computing the average over
a set of real numbers• Sample a subset of real numbers• Approximate the true average by sample average.
True Average:• Average of (0.11, 0.24, 0.55, 0.77, 0.88,0.99)=0.59
Sample Average over 2 samples: • Average of (0.24, 0.77) = 0.505
How to generate samples from Q
• Express Q in product form: Q(Z)=Q(Z1)Q(Z2|Z1)….Q(Zn|Z1,..Zn-1)
• Sample along the order Z1,..Zn
• Example: Q(Z1)=(0.2,0.8) Q(Z2|Z1)=(0.2,0.8,0.1,0.9) Q(Z3|Z1,Z2)=Q(Z3|Z1)=(0.5,0.5,0.3,0.7)
N
ii
i
zZQeEzZP
NeEP
1 )(),(1
)(
How to sample from Q?
• Each Sample Z=z Sample Z1=z1 from Q(Z1) Sample Z2=z2 from Q(Z2|Z1=z1) Sample Z3=z3 from Q(Z3|Z1=z1)
• Generate N such samples
)(1)(
),(1)(
)z,...,(z Samples
11
n1
iN
i
N
ii
i
zZwNzZQ
eEzZPN
eEP
Likelihood weighting
• Q= Prior Distribution = CPTs of the Bayesian network
Likelihood weighting example
lung Cancer
Smoking
X-ray
Bronchitis
DyspnoeaP(D|C,B)
P(B|S)
P(S)
P(X|C,S)
P(C|S)
P(S, C, B, X, D) = P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B)
0)BC,|S)P(DC,|1S)P(X|0S)P(B|P(S)P(C0)B1,P(Xfalse0 and 1 where?)0,1( trueBXP
Likelihood weighting example
lung Cancer
Smoking
X-ray
Bronchitis
DyspnoeaP(D|C,B)
P(B|S)
P(S)
P(X|C,S)
P(C|S)
Q=Prior
Q(S,C,D)=Q(S)*Q(C|S)*Q(D|C,B=0)
=P(S)P(C|S)P(D|C,B=0)
Sample S=s from P(S)
Sample C=c from P(C|S=s)
Sample D=d from P(D|C=c,B=0)
N
ii
i
zZQeEzZP
NeEP
1 )(),(1)(
),|1()|0()0,|()|()(
)0,|(),|1()|0()|()()0,|()|()(
)0,1,,,()(
),()(
sScCXPsSBPBcCdDPsScCPsSP
BcCdDPsScCXPsSBPsScCPsSPBcCdDPsScCPsSP
BXdDcCsSPzZQ
eEzZPzZw i
ii
How to solve belief updating?
eEeExX
eEPeExXPeExXP
ii
iiii
is Evidence :rDenominato, is Evidence :Numerator
sampling importanceby r Denominato andNumerator Estimate)(
),()|(
0 , z sample iff 1),(,
)(
)(),()|(ˆ
j
1
1
elsexXcontainszxwhere
zw
zwzxeExXP
iij
i
N
j
j
N
j
jji
ii
Summary