mixing exact and importance sampling propagation algorithms in

Mixing Exact and Importance SamplingPropagation Algorithms inDependence GraphsLuis D. Herna

ńdez*

Dpto. de Informa´tica y Sistemas, Facultad de Informa

´tica, Universidad de

Murcia, 30071, Espinardo, Murcia, Spain

Serafıń Moral†

Dpto. CC. de la Computacioń e I.A., E.T.S. de Informa

´tica, Avd. de

Andalucıa, 38, Universidad de Granada, 18071 Granada, Spain

In this article a new algorithm is presented for the propagation of probabilities in junctiontrees. It is based on a hybrid methodology. Given a junction tree, some of the nodescarry out an exact calculation, and the other an approximation by Monte Carlo methods.For the exact calculation we will use Shafer/Shenoy method and for the Monte Carloestimation a general class of importance sampling algorithms is used. We briefly studyhow to apply this sampler on the clusters in a junction tree. The basic algorithm andsome of its variations are presented, depending on the family of functions to which weapply the importance sampler: potentials or/and messages in the tree. An experimentalevaluation is carried out, comparing their performance with the well-known likelihoodweighting approximated algorithm. This family of methods shows a very promising perfor-mance. 1997 John Wiley & Sons, Inc.

I. INTRODUCTION

The calculation of probabilities with a great number of variables is a difficulttask because of its complexity. In fact, the first expert systems used nonprobabilis-tic methods to represent the uncertainty to avoid this problem. That is the caseof MYCIN1 or PROSPECTOR.2

Things started to change with the so-called propagation algorithms.3–8 Thebasic idea of these algorithms is to decompose the global problem in severalsmaller subproblems which communicate between themselves in an appropriate

*e-mail: [email protected]†e-mail: [email protected]

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 12, 553–576 (1997) 1997 John Wiley & Sons, Inc. CCC 0884-8173/97/080553-24

554 HERNA´

NDEZ AND MORAL

way. Since each of these subproblems involves only a few variables, we can solvethem in order to obtain a correct global solution.

Shachter et al.9 have demonstrated that, in essence, all the exact propagationalgorithms are equivalent and are based on calculating a triangulation of thedependence graph. Taking this triangulation as basis, the problem is decomposedinto smaller subproblems. These algorithms receive the name of algorithms basedon clusters (i.e., subproblems). Although these algorithms represent a big stepforward in obtaining solutions to practical problems, they cannot be used in allthe situations. There are a lot of cases in which they are completely inefficient.The main problem is that their performance depends strongly on the topologyof the network: they are especially sensitive to the number of cycles in the graph.In fact, the inference using exact methods is a problem NP-Hard.10

As a result of this situation approximate methods have been developed.These methods are mainly based on Monte Carlo estimation techniques. Theadvantage of these methods is that they grow polynomially with respect to thesize of the problem and so they can be applied to a wider variety of situations.However, when a grade of precision is given, the schema of approximate inferenceis NP-hard too.11

There are two main classes of Monte Carlo methods: one is based on MarkovChains and the other is based on importance sampling techniques. Both try tosolve the problem of sampling from difficult distributions. The methods basedon Markov Chains obtain dependent samples in which each case depends onthe previous one (a case is a configuration: a value assigned to each variable ofthe problem). Usually, Markov Chain Monte Carlo proceeds by selecting avariable and then changing its value taking into account the values of the restof the variables. In this way, each configuration is strongly dependent on thelast one. However, it can be shown that if we use this sample to estimate theprobabilities of the variables, under suitable conditions, we have convergenceto the true values.12 Some problems of this convergence were studied by Chinand Cooper.13 Jensen, Kong, and Kjærulff14 have generalized this schema givingthe possibility to generate samples with a higher degree of independence, bymodifying more than one variable in each step.

Importance sampling algorithms are based on independent samples obtainedwith a different distribution from the one we want to estimate. The samples haveto be weighted to compensate for the difference with the true distribution.Usually the variables are simulated in the natural order of the graph. That is,they go from parents to children. This simplifies the simulation process. The firstapproximate algorithm from this group was proposed by Henrion15 and is knownas logic sampling. The algorithm works well when there is no evidence in thegraph, but it is very inefficient when observations are present, especially whenthese observations have a small probability. An algorithm which tries to solvethis inconvenience is likelihood weighting. It was developed, independently, byFung and Chang16 and Shachter and Peot.17 This method attempts too solve theinconvenience of logic sampling and, in general, it offers very good performance.However, we could give simple examples in which this procedure has the sameproblems as logic sampling.18,19

PROPAGATION ALGORITHMS IN DEPENDENCE GRAPHS 555

Fung and Del Favero20 have given a new simulation procedure in which thenatural order for the simulation of variables is not always followed and samplingstarts in the observations going backward to the root nodes. A more sophisticatedapproach has been presented by Cano et al.18 This procedure considers arbitraryorders of selection of the variables, with the entropy being the main guide tochoosing a variable to sample. Variables with less entropy, that we know betterin order to obtain a ‘‘good’’ sample, are selected first.

Over the last years, a new family of methods has emerged: the so-calledhybrid methods.21 They are based on the following two ideas: (1) break theproblem down into a combination of subproblems as if we were to apply anexact method; (2) apply to the most appropriate algorithm for each problemtaking into account its characteristics. One obvious classification of the subpro-blems is by their size, applying exact algorithms to the small and easy problemsand approximate procedures for the large and difficult problems. The idea ismore ambitious, allowing the use of continuous variables, symbolic specificationof the probabilities and even different formalisms to represent each one of thesubproblems. However, the first effective hybrid algorithms mix particular exactand Monte Carlo algorithms.22

This article presents a new class of hybrid exact and Monte Carlo algorithms.The main difference with Kjærulff’s approach22 is that here the Monte Carloalgorithm is the importance algorithm considered by Cano et al.18 instead of theGibbs sampling methodology. Another difference, in this article, is the exactalgorithm considered. The initial algorithms are based on the Shafer/Shenoy8

methodology instead of using the Hugin algorithm.23,24 They are later modifiedto resemble the Hugin algorithm, having a similar computational cost. This willallow us to compare different versions of hybrid algorithms depending on whatare the objectives of the simulation: to estimate the potentials, the messages, orthe potentials including all the messages. The latter is the most similar to Huginalgorithm. These different versions will have different performances in terms ofcost and accuracy.

The article has been structured in the following way. First we will showbasic definitions and notations (Sec. II). These are perhaps complicated, but arenecessary for a precise specification of the problem and the algorithms. In Section3, we introduce the algorithms. First, we briefly recall the Shafer/Shenoy basicalgorithm.8 Next, we give the importance sampling algorithm as a general MonteCarlo method (subsection III-B). Subsection III-C presents the foundations ofhybrid algorithms according to Dawid et al.21 and three new hybrid algorithms.Finally Section IV is devoted to an experimental evaluation of the algorithmsand Section V to the conclusions.

II. NOTATION AND DEFINITIONS

Consider a set of random variables hXiji[N , with N 5 h1, 2, . . . , nj, andassume that each variable Xi is taking values on a finite set Vi . We will denotethe n-dimensional variable (X1 , X2 , . . . , Xn) by X, which takes values on thecartesian product VN 5 P i[N Vi .

556 HERNA´

NDEZ AND MORAL

If xI [ VI 5 P i[I Vi and J , I we denote by xQJ the element of VJ 5P j[J Vj obtained from xI by eliminating coordinates which are not in J.

In general, in this article we shall consider functions defined on VI (I # N)and taking values in R1

0 the set of non-negative real numbers.If h is a function defined in VI , and xJ [ VJ , then RJ(h, xJ) will denote the

function defined in VI2J by:

RJ(h, xJ)(yI2J ) 5 h(z) (1)

where z [ VI is given by zQJ 5 xJ , zQI2J 5 yI2J . That is, RJ(h, xJ) is the functionobtained from h fixing the coordinates in J to the value xJ and leaving the othersfree. We call this function reduction of h by xJ .

Given a function h: VI R R10 the group of indices in which h is defined will

be denoted by s(h) [i.e., s(h) 5 I].Given h, q(h) will be the sum of all values of h [i.e., q(h) 5 oxI

h(xI)] andif q(h) ? 0, then Q(h) will be the function given by:

Q(h)(x) 5h(x)q(h)

;x [ VI (2)

That is Q(h) is the normalization of h to add 1. It will be called the samplingprobability associated to h.

We define the sampling probability associated to h conditioned to a configu-ration xK of VK , to the function obtained by the reduction of h by the configurationxK , using (1), and, its subsequent normalization using (2), that is:

Q(h u xK) 5 Q(R(h, xK)) 5R(h, xK)

q(R(h, xK))(3)

When there is no doubt about the conditioning set and in order of simplifythe notation, Q(h u xK) will be denoted as Q(h).

A dependence graph on the set of variables hXiji[N is an acyclic directedgraph which reflects in its topology the independence relationships among thevariables by means of the d-separation criterion.3 Each node of the graph isrepresented by an index i [ N, which represents variable Xi . Taking into accountthe independence relationships of the graph, a global probability for all thevariables can be given by a probability distribution for each one of the variablesXi . Taking into account the independence relationships of the graph, a globalproability for all the variables can be given by a probability distribution for eachone of the variables Xi conditioned to its parents. That is, if F (i) is the set ofparents of node i, there is a function, fi , defined for variables Xhij<F(i)

fi(xQs( fi)) 5 fi(xQi, xQF(i)) ;xQs( fi) [ Vhij<F(i)

verifying, for each xQF(i) [ VF(i) :

OxQi

fi(xQs( fi)) 5 1

And the joint probability for the n-dimensional variable X 5 (X1 , X2 , . . . ,Xn) can be calculated by means of the expression:


f (x) 5 pi[N

fi(xQs( fi)) 5 pi[N

fi(xQi, xQF(i)) ;x [ VN (4)

A variable Xi is said to be observed if we know the exact value ei [ Vi ofthis variable. Each observation-variable Xi has associated a Dirac delta-functiondei

defined on Vi by:

dei(xi) 55

1, if xi 5 ei

; xi [ Vi

0, if xi ? ei

(5)

The set of indices E # N of the observed variables is called the set ofobservations, and the vector of values eE 5 (ei)i[E taken by these variables, theevidence of the model.

The problem that probabilistic propagation algorithms try to solve is thefollowing: given a set of observations E, evidence eE 5 (ei)i[E , and a variable Xl ,to calculate the a posteriori distribution of Xl given the evidence xE 5 eE , that is

P(xl u eE) Y P(xl > eE) 5 OxQl5xl

P(x > eE), ;xl [ Vl

The probability P(x > xE) is equal to the product of functions,

P(x > eE) 5 Spi[N

fi(xQi, xQF(i))D ?Spi[E

dei(xQi)D (6)

So, to calculate P(xl u eE) it suffices to calculate,

OxQl5xl

Spi[N


dei(xQi)D , ;xl [ Vl (7)

and to normalize these values so that the addition of the conditioned probabilitiesis equal to 1.

III. THE ALGORITHM

A. The Exact Algorithm

Here we briefly describe Shafer and Shenoy’s propagation algorithm.6,8 Ithas two main parts:

(1) Decompose the model(2) Communicate the subproblems

The first part is carried by triangulation of the moral graph associated witha directed acyclic graph and the subsequent construction of a junction tree. Thisis a tree of clusters or groups of variables which verifies the following conditions:

558 HERNA´

NDEZ AND MORAL

Figure 1. (a) A causal network. (b) The moral graph of (a). (c) A triangulation of (b)and (d) the junction tree associated to (c).

(1) If a variable is in two clusters then this common variable is in all the clusters onthe path joining them.

(2) For each function fi in the belief network (and for each observation di) there is,at least, one cluster C which verifies s( fi) , C (s(di) , C).

Each function fi and observation di is assigned to one of the clusters verifyingcondition 2. In this way, each cluster has an associated set of functions LC . Thisallows us define a function cC called the potential of C which is equal to themultiplication of the functions in LC . Thus, the topological decomposition of thevariables (by triangulation) allows us to decompose the function defining the jointprobability of the model into a set of functions cC , one for each cluster, verifying:

P(x > eE) 5 Spi[N


di(xQi)D5 pC

cC(xQC) (8)

Then our objective, according to expression (7) will be to calculate:

OxQl5xl

pC

(xQC) (9)

If a cluster C has an empty set of associated functions then the potential cC

is equal to 1. One example may be seen in Figure 1.The second part of the algorithm is to send out the information among

clusters. For each pair of two adjacent clusters C and D, a new cluster S 5 C >D, is defined on the edge which joins them. This cluster is called the Separator.Notice that the new tree with separators is a junction tree, too. Two new functionsare defined for each separator: MCv

S and MCxS (initially equal to 1). MCv

S , representsthe information that C receives from D through S; and, the second one, MCx

S ,


represents the information that C sends to D through S (see Refs. 8 and 9 formore details). MCv

S and M CxS are also denoted as MDx

S and MDvS , respectively.

Thus, if S is the separator between C and D, the Shafer and Shenoy’s methodconsiders the following potentials:

cC(xQC) 5 pfi[LC

fi(xQs(fi)) (10)

MCvS (xQS) 5 MDx

S (xQS)

5 OxQD2S

cD(xQD) pS9[Sep9(D)

MDvS9 (xQS9) (11)

MCxS (xQS) 5 MDv

S (xQS)

5 OxQC2S

cC(xQC) pS9[Sep9(C)

MCvS9 (xQS9) (12)

for every value x [ V. The sets Sep9(C) and Sep9(D) are Sep(C) 2 hSj andSep(D) 2 hSj, respectively, where Sep(C) is the set of separators of the clusterC and Sep(D) is the set of separators of the cluster D.

To send a message from C to D is to calculate MCxS (xQS). An alternative

way of doing this is to calculate the potential on C including the input messages:

CC(xQC) 5 cC(xQC) pS9[Sep(C)

MCvS9 (xQS9) (13)

and then,

MCxS (xQS) 5

oxQC2S CC(xQC)MCv

S (xQS)(14)

where 0/0 5 0.The information received by cluster C from its separator S, MCv

S [see (11)],can be seen as the processed information contained in the part of the treeconnected to C through S (see Fig. 2).

In order to modify these potentials the following basic operations are used:*Absorption. Given a cluster C, it absorbs the information from its neigh-

boring clusters if its potential cC is modified by the expression:

cnewC 5 cold

C 3 pS[Sep(C)

MCvS (15)

where Sep(C) is the set of separators of the cluster C.Petition to Collect Evidence (CE). If a cluster C receives a request to collect

evidence from Cp , then C sends a CE petition to all neighbors except Cp ; of Chave completed the task, then C sends the message MCx

C>Cpto Cp .

Petition to Distribute Evidence (DE). If the cluster C receives a DE request

*Note we use the same terminology as that in Hugin shell.

560 HERNA´

NDEZ AND MORAL

Figure 2. Propagation of messages between adjacent clusters.

from the cluster Cp , then it sends the messages MCxC>Cq

to all the neighbors Cq ofC except Cp and then sends a petition DE to them.

With these basic operations the propagation algorithm can be expressed as:

Exact Basic Propagation1. Choose a cluster C as the pivot cluster.2. Send a petition to collect evidence to all neighbor clusters of C.3. Send a petition to distribute evidence to all neighbor clusters of C.4. Do the absorption of information for each cluster.

Note the next observations:

(1) The potentials of the clusters are updated only at the end of the algorithm. Onlythe potential of the separators are updated during the running of the algorithm.

(2) At the end of the algorithm, if variable Xl is in cluster C and CC is the updatedpotential of this cluster (including the incoming messages), then the value ofexpression (9) can be obtained by calculating,

Ox[VC

xQl5xl

C(x) (16)

(3) The messages that a cluster receives and sends through a separator are indepen-dent. There is no potential which is used in the calculation of the two messages.This does not happen for other algorithms. For example, the Hugin algorithm23

has no independent messages because it considers only one message for each


edge and, so, the potential of each cluster is updated during the running of thealgorithm. In fact, this independence of the mesages is the main difference fromthe Hugin algorithm. When we use formula (13) to calculate messages thencomputations can be arranged in a similar way to Hugin, achieving the samedegree of efficiency. We do not enter into the details of how this could be achieved.

B. The Approximate Algorithm: Importance Sampling for Clusters

Here we are going to give a very simple description of the general class ofimportance sampling algorithms. Further details are to be found in Refs. 18and 25.

Assume that h is a function from a set V into the non-negative reals andthat we want to calculate t 5 ox[V h(x). This methodology consists in selectinga sample (x(1), . . . , x(k)) according to a probability distribution P* on V verifyingthat P*(x) . 0 if h(x) . 0.

As we can express,

Ox[V

h(x) 5 Ox[V

h(x)P*(x)

P*(x) 5 EP* F hP*G (17)

then an unbiased estimation of t can be obtained by calculating the average ofquantities h(x(i))/P*(x(i)),

tˆ

51k O

k

i51

h(x(i))P*(x(i))

(18)

The quantity h(x(i))/P*(x(i)) will be denoted as li and will be called theweight of x(i).

It can be shown that the variance of the estimation is minimum when P*is proportional to h; more specifically when P* 5 h/t. However, in general thisis difficult and we should try to select a probability P* as close as possible to h.In fact the variance of the estimation25 is given by the expression:

Var(tˆ) 5 O

x[VS h(x)

P*(x)2 tD2

P*(x) 5 FOx[V

h2(x)P*(x)G2 t2 (19)

We can see importance sampling as a transformation of mapping h intomapping h

ˆgiven by:

hˆ(x) 5

1k O

x(i)5x

h(x(i))P*(x(i))

(20)

Then the estimation of t 5 ox[g h(x) is given by tˆ

5 ox[g hˆ(x).

The advantage of hˆ

is that the number of elements x [ V for whichhˆ(x) ? 0 is at the most k, then if the number of elements of V is too great, we

can always use a sparse representation of hˆ, representing only the nonzero ele-

ments, the size of the representation being of the same order as the size of thesample we are using.

Let (x(1), . . . , x(k)) be a sample used to estimate t 5 ox[V h(x) with weights

562 HERNA´

NDEZ AND MORAL

li 5 h(x(i))/P*(x(i)) (i 5 1, . . . , k). If we want to estimate f 5 ox[V h(x) ? g(x),then this sample can also be used as a weighted estimator of f. We have onlyto change the weights from li to li ? g(x(i)). The problem is that P* can be veryappropriate for estimating t but not very good for estimating f. That is, P* canbe almost proportional to h but not to h ? g.

A special case, is when h is a function in V, we have a partition of the setV in subsets D1 , . . . , Dl and we want to calculate the values,

tj 5 Ox[Dj

h(x), j 5 1, . . . , l (21)

If (x(1), . . . , x(k)) is a sample, that we have used to estimate t 5 ox[g h(x),then we can use this sample to estimate each one of the values tj , following theprocedure as above with g equal to the characteristic function of Dj (with value1 in this set an 0 outside). If IDj

is this characteristic function, then the estimationcan be expressed as,

tˆ

j 51k O

k

i51

h(x(i)) ? IDj(x(i))

P*(x(i))5

1k O

x(i)[Dj

h(x(i))P*(x(i))

(22)

The variance of tj estimator is,

Var(tˆ

j) 5 FOx[Dj

h2(x)P*(x)G2 t2

j (23)

Since Di , . . . , Dl is a partition of V, is very easy to prove, from thisexpression, that:

Ol

j51Var(t

ˆj) 5 Var(t

ˆ) 1 t2 2 Ol

j51t2

j (24)

As t 5 olj51 tj , it is clear that ol

j51 Var(tˆ

j) $ Var(tˆ). The most important

conclusion from Eq. (24) is that if we want to use the same sample for theestimation of all the tj , and we consider the addition of the variances as a measureof goodness of the estimation, then this measure is the variance of the estimatortˆplus a constant value, independent of the sample. So, looking for good estimators

for all the tj using a sample (x(i), . . . , x(k)QDj [see(22)] is equivalent to looking forgood estimators for t 5 ol

j51 tj using the sample (x(i), . . . , x(k), by the previousresult.

In our case, this procedure will be applied to carry out the calculations ona cluster in a propagation algorithm. In concrete, we will have a vector ofvariables, XN 5 (X1 , . . . , Xn), and C a subset of N 5 h1, . . . , nj being XC thevariable† associated to C. We will have a set of potentials assigned to subset C,LC 5 hhi , i 5 1, . . . , mj, in such a way that <m

i51 s(hi) 5 C. If we make h 5Pm

i51 hi , our objective will be to calculate a function f defined for a set of variablesC9 # C and given by,

†Sometimes and when it is clear from the context we shall not distinguish betweena set of indices C and its set of variables XC .


f (xQC9) 5 OxQC2C9

h(xQC) (25)

The difficulty of this calculation, arises from the fact that sometimes C istoo big for a direct calculation of h to be possible. In the following we proposean importance sampling algorithm designed to make an estimate of f.

If VC is the space in which h takes its values, we can think about thecalculation of f as in the calculation of different values t1 , . . . , tl , where eachtj is the addition of h on a subset, Dj , of a partition of VC . That is tj is given by

tj 5 Ox[Dj

h(xQC) (26)

If we consider that each one of the elements xC9 [ VC9 is in a set Dj , then Dj isgiven by the set of all the x on VC such that xQC9 5 xC9 . In fact, if VC9 just has lvalues, Dj is given by Dj 5 hx j

C9j 3 VC2C9 where x jC9 is the j th-value of XC9 and

j 5 1, . . . , l. That is, (26) can be rewritten as:

tj 5 OxC[VC

xQC9C 5x j

C9

h(xC) (27)

In this way, an estimator for f can be obtained by means of a sample(x (1), . . . , x(k)) on VC following a distribution P* and transforming h into h

ˆ

according to Eq. (20). hˆ

is a more manageable mapping because it has a limitednumber of elements different from 0.

The sampler we are going to describe was introduced by Cano et al.18 Itobtains a sample on VC based on the fact that h can be decomposed as a productof the functions in LC . The idea is the following:

● It takes the functions hi [ LC . Each time it takes a function, it obtains a value forthe variables s(hi), xs(hi)

, for which this function is defined. This is carried out bynormalizing hi , that is, by considering Q(hi). After this, the values obtained forthese variables are introduced in the remaining functions, hj , on LC , by calculatingRs(hi)

(hj , xs(hi)). This selection continues until we have a value for each variable in C.

● The functions that have not been selected for simulation are evaluated to calculatethe weight. This weight also depends on the normalizing factors, Q(hi), each timethat we simulate a value with a function hi .

The first function we choose is used to simulate. The last one is used onlyto calculate the weight. One part of the intermediate functions are used to obtainvalues for the variables. The other part, which has just one value for the variablesalready simulated, is used to calculate the weight. The intuitive idea is as follows:if we want to sample with respect to a probability which is proportional to aproduct of simple functions, and the product is very complicated to manage (itis defined in a very large cartesian product), then we use only some of thefunctions we are multiplying to make the simulation, with the idea of using afunction similar to the complete product. As we explain later, to achieve this,we should choose the most informative functions for the simulation. The otherfunctions should be used to weight the sample and compensate for the differences

564 HERNA´

NDEZ AND MORAL

between the desired distribution and the one used. These functions should bethe least informative ones. Further details of the algorithm and a proof of thefact that the weights are correct can be found in Cano et al.18

In algorithmic terminology, this sampler can be expressed in the follow-ing way:

Importance Sampler

1. From j 5 1 to k do(a) Do l( j)

C9 5 1,0(b) While there is a variable in C which has not been simulated, do:

i. Choose a function hi from the set LC

ii. Do LC :5 LC 2 hhijiii. Do l( j)

C9 :5 l( j)C9 p q(hi)

iv. Simulate a value xs(hi)according to

Q(hi)(xQs(hi)) 5hi(xQs(hi))

q(hi)

v. Do the reduction of all remaining functions hj [ LC by the value xs(hi), that

is, redefine LC by:

LC :5 hRs(hi)(hj , xs(hi)

) u hj [ LCj

(c) While LC ? B doi. Choose the next function hi from the set LC

ii. Do LC :5 LC 2 hhijiii. Do l( j)

C9 5 l( j)C9 p hi(xQB)

2. Define

hˆ(xQC9) 5 O

i: x(i)C9

5xQC9

l(i)C9

k; xQC9 [ VC9

This sampling program will be denoted as f. It accepts as input the sets C9and C, the family of functions LC and the size of the sample k. The output isthe function h

ˆ5 f(C9, C, LC , k).

This sampler is very general. Different procedures, including some of thebest-known global simulation procedures, can be obtained with different waysof selecting the next function hi to simulate in the step (b).i of the algorithm.19

The functions chosen at the beginning are used to simulate and the functionsat the end are used to weight. If we have a function which fixes the values ofthe variables to a specific value, then it will have a lot of zeros and then it is nota good function to weight: we will obtain a value of zero for many weights, andthey will be not appropriate to stimulate. On the contrary, an uninformativefunction, assigning the same weight to all the cases can be left for the last withoutany problem: it does not introduce variation on the weights. Taking this idea asa basis, we have introduced18,19 the following selection criterion: Choose to simu-late the functions with less entropy (i.e., the functions with more information).Thus, in the step (b).i, the next selected function hi is the one verifying the con-dition:


E(hi) 5 minhE(hj ) u hi ? hj [ LCj

where E(hj ) 5 2ox[Vs(hj)

Q(hj )(x) ln Q(hj )(x).

This criterion was shown18,19 to present a very good performance.

C. The Hybridation: ED Clusters and N-mc Clusters

Here we introduce a hybrid algorithm to propagate probabilities. It is basedon the Dawid et al. methodology.21 As was said in the introduction, the hybrida-tion of the two methods is based on the application of an exact methodologyfor small clusters (Shafer/Shenoy exact method, with modifications in some cases)and an approximate methodology for big clusters (the importance sampler).

After calculating a junction tree for the model, then each cluster C has avector of variables XC , and a potential cC which is defined by the product of aset of functions LC 5 h f1 , f2 , . . . , flj, The following situations can be considered:‡

(1) If the size of VC is moderate, then we carry out an exact calculation

cC(xC) 5 pfi[LC

fi(xQs( fi)C )

We will say that the cluster works exactly and that the typed of the cluster isExact Discrete (ED).

(2) If the size of VC is not moderate, then the different messages are calculated byusing an importance sampler. The sampler can be applied directly to the familyLC or to this family including some or all of the incoming messages. Dependingon the case we will obtain different versions of the algorithm. In this situationwe shall say that the cluster works approximately and that the type of the clusteris Numeric Monte Carlo (N-mc)

As follows we describe the different algorithms depending on the family towhich we apply the importance sampler. In all of them if the size of a separatoris not of a moderate size then the clusters it joins are not either, and we takethe necessary sparse representation of the potentials in such clusters to obtaina sparse representation in the separator.

1. Estimating Potentials

This algorithm considers for each cluster C the set LC equal to all functionsfi which are associated to C. Then we carry out an importance sampling estimationof the cluster potential c

ˆC 5 f(C, C, LC , k). This potential replaces the original

potential cC and all the subsequent calculations are done as if this cluster wereExact Discrete. This sampler works before any propagation.

The algorithm may be written as:

‡Note that we follow the terminology of Dawid et al.21

566 HERNA´

NDEZ AND MORAL

Estimating Potentials

1. For each N-mc cluster, C, do:(a) Run the importance sampler c

ˆC 5 f(C, C, LC , k), where LC is the set of

functions associated to C.(b) Define cC :5 c

ˆC

2. Do basic propagation.

The absorption into the step 2 is done as in (15). Note that the absorptionaccording to (15) is equivalent to running a sampler for LC equal to all functionsfi in the cluster and all the input messages for the considered cluster, but wherethe sampler uses first the functions fi in the original LC in order to sample andthen the messages to calculate the weights.

This algorithm just develops one sampling for each N-mc cluster, which isits main advantage. But to obtain a sample in a cluster it only uses the a priorifunctions of the model in this cluster, which is its main disadvantage. We wantto estimate the a posteriori probability of a variable which is proportional to theproduct of all the potentials as it is expressed in (6).

In fact, we can obtain configurations which can have a final weight of zerowhen we include the input messages. The ideal thing would be that the algorithmfocus only on possible configurations, but this is impossible without doing somekind of propagation.

2. Estimating Messages

These algorithms do not replace the initial potential of the N-mc cluster byan estimated potential. They operate on the messages level. More specifically,suppose an N-mc cluster C which is active to calculate the output messageMCx

S .Then this is carried out by means of an importance sampling algorithm

which is applied to the family of functions L9C composed of the functions in LC

plus all the input messages from the separators other than S. Then instead ofsending MCx

S , we send Mˆ

CxS 5 f(S, C, L9C, k).

In detail, the algorithm is as follows:

Estimating MessagesRun the basic algorithm but if a N-mc cluster C sends a message MCx

S through S do

1. Let LC be the set functions associated to cluster C.2. Define

L9C :5 LC < F <S9[Sep9(S)

MCvS9 G

3. Apply the importance sampler Mˆ

CxS 5 f(S, C, L9C , k)

4. Define

MCxS (xS) :5 M

ˆCxS (xS)


In this case, the absorption in C is done by running a sampler on C withL9C equal LC plus all input messages MCv

S .In this algorithm, the estimation of an output message considers not only

the a priori information in the cluster but also the processed information whichreaches the cluster from other parts of the junction tree through the separators.This represents a clear advantage over the former algorithm, because we areconsidering more information in order to obtain final constant weights. Neverthe-less, it needs more computational time in order to run a greater number ofsamples (one for each output message); which is its main disadvantage.

3. Estimating Potentials and Messages

The second hybrid algorithm is very appropriate for estimating the messages,but this is not our final objective. Our aim is to estimate the product of all initialpotentials marginalized on a given variable [see expression (7)]. So we shouldtry to obtain constant weights involving all the potentials in the problem. Thefollowing example gives us an idea of the problem.

Example 3.1. Let C and D be two clusters, with a common separator S 5 hX,Yj, with V1 5 hx1 , x2j, V2 5 hy1 , y2j and assume that we want to send a messagefrom C to D. If C is N-mc, by the messages estimation algorithm, on estimationof MCx

S will have the general expression:

Mˆ

CxS (x, y) 5 5

« si (x, y) 5 (x1 , y1)

d si (x, y) 5 (x1 , y2)

1 2 « si (x, y) 5 (x2 , y1)

1 2 d si (x, y) 5 (x2 , y2)

But if the message from D to C, is

MCvS (x, y) 5 5

1, si (x, y) 5 (x1 , y1)

1, si (x, y) 5 (x1 , y2)

0, si (x, y) 5 (x2 , y1)

0, si (x, y) 5 (x2 , y2)

We have a problem. Most of the effort (most of the weight) Mˆ

CxS (x, y) has

been devoted to the configurations for which X 5 x2 , and these configurationsare impossible. They are going to receive a weight of 0, when MCv

S is considered.It would have been better if from the beginning we had concentrated on theestimation of the values X 5 x1 , which in the end are the only possible values.

The problem with the example above can be solved if we carry out an

568 HERNA´

NDEZ AND MORAL

Figure 3. An example of fusion of clusters.

importance sampler with a new family of functions L0C including the originalfunctions in LC and all the input messages. Thus, we can consider all inputmessages of C in order to estimate CC according to expression (13), CC 5f(S, C, L0C , k). Later we may use expression (14) to estimate the output messages,where CC is the estimated potential C

ˆC . Furthermore this is a more efficient

method for calculating the output messages, because we only need to run oneimportance sampler to estimate CC and with this estimation we can calculate allthe output messages.

Note that in expression (14) the division by 0 is not a problem because0/0 5 0. A configuration x such that MCv

S (xQS) 5 0 will always obtain a weightequal to 0 because this message is included on the list of functions of the impor-tance sampler, and then C

ˆC(x) 5 0.

This algorithm considers all the input messages for obtaining more appro-priate output messages. One problem is that, in the beginning, we do not havevery good input messages, so the output messages cannot use this advantage. Apossibility which is not considered in this article is to repeat the basic propagationalgorithm twice. In the second repetition, we always have estimations of the


input messages and we can improve the estimation of the output ones. However,for us it is not clear that if we have a fixed computation time, it is always betterto split the effort into two repetitions. The main difficulty is that, once we haveobtained an estimation of 0 for a configuration on a potential or message, thenit is impossible to recover an original which was different from 0. So, if we makean erroneous 0 estimation with a smaller sample, then this 0 never changes, withindependence of the effort we do in the future. Thus, repeating the basic algorithmtwice (or more times) has a tendency to introduce more 0 values than concentrat-ing the effort on one repetition. The optimal strategy is out of the scope of thisarticle, and will be considered in future studies.

There is some information which should be added to the N-mc clusterswithout additional computation cost. This information consists of the evidenceor observations we have for a particular case of the problem. These observationsare represented by means of Dirac-delta functions which are idempotent: wecan introduce the information of an observed variable on all the clusters in whichthis variable is included.

The exact algorithms based on DE clusters (including the Shafer/Shenoyalgorithm) determine that each potential in the problem is assigned to one clusteronly and this is propagated by the messages through the junction tree. But if weintroduce the potential associated on an observation about a variable in a cluster,and this variable appears in another cluster which is simulated earlier, then inthis simulation we can obtain configurations which do not correspond to theobserved values whereby they finally obtain a weight of 0 and are useless. Thisproblem is avoided if the observations are introduced in all the clusters in whichthis variable appears. This can be done because Dirac-delta functions are idempo-tent. If we follow the entropy criterion, Dirac-delta functions have minimumentropy and are chosen to simulate before other functions, fixing the value ofthe corresponding variable to the observed value.

D. Joining adjacent N-mc clusters

Suppose C, D are two adjacent N-mc clusters (that is, both have a bigdimension). Imagine a chain of passing messages reaching C and then going toD. We do a simulation in C and estimate the message to D (this message willbe, in most cases, too big and we will need a sparse representation of it). Thismessage is received by D in order to make an estimation of its outgoing messages.We could think about this process as a global simulation procedure in which weare forced to simulate, in the first place, with the potentials in C and, later, withthe potentials in D. Would it not be a better selection procedure if we allowedthe potentials to be selected in C and D without any priority? This is achievedby joining clusters C and D and assigning to this new cluster the potentials onC and the potentials on D.

There is no problem in carrying out this union of clusters because we obtaina new junction tree:

PROPOSITION 3.1. Let D be a junction tree and two adjacent clusters, C1 and C2

570 HERNA´

NDEZ AND MORAL

with nonempty intersection, then the tree, D9, constructed according to the followingrules is also a junction tree.

1. The nodes of the new tree are the same as in D except that clusters C1 andC2 are replaced by cluster C, which has, as set of variables, the union ofthe variables in C1 and C2 .

2. In the new tree, the set of neighboring clusters of C is the union of theneighbors of C1 and C2 . All other links not involving C are as in D.

3. In the new tree, the set of functions associated to C is the union of thefunctions associated to C1 and C2 . All other nodes have the same set ofassociated functions.

Proof. The proof is based on the fact that the graph so constructed is a tree andin checking the two conditions defining a junction tree:

(1) Any variable X in the clusters U1 and U2 , is also in the clusters of the pathjoining them.

(2) Each potential is assigned to one cluster.

Condition (2) is evident from the construction we have done.In order to show the condition (1), we consider a variable X which is includedin the clusters U1 and U2 of the new tree D9. Then we can have two situations:

(1) C is not on the path going from U1 to U2 .(2) C is on the path.

In the first case, the same path exists on the old tree and X must belong to allthe clusters on the path. In the second case, if [U1 , . . . , C, . . . , U2] is the path,we can have two situations:

(a) The path going through C uses only links corresponding to one of the clusters,C1 (or C2). In this case making the substitution of C for C1 (or C2) we obtain apath [U1 , . . . , C1 , . . . , U2] on the old tree and X belongs to all the clusterson this path, and therefore to all the clusters on [U1 , . . . , C, . . . , U2].

(b) The path going through C uses a link corresponding to C1 and another correspond-ing to C2 . For example, it arrives with one from C1 and it leaves with one fromC2 , then changing C for C1 , C2 gives a path on the old tree: [U1 , . . . , C1 ,C2 , . . . , U2]. Thus X belongs to all the clusters on this path, and therefore toall the clusters on [U1 , . . . , C, . . . , U2]. j

Joining two clusters will allow as to apply a criterion such as minimumentropy globally to all the functions included in both clusters.

Another situation in order to fuse clusters is the following. It is possiblethat some clusters have no associated potential. To make the calculations, it isconsidered that this cluster has a potential equal to 1 for all the configurations.Then, the fusion of this cluster with some adjacent cluster will not increaseexcessively the dimension of the clusters (at least the size of the total potential).


In this way we can minimize the use of these artificially added uniform potentials.For example, the join tree in Figure 3(b) has the potentials of the followingtable:

Cluster Variables Associated Functions

C1 B2, C1, D1 fD1

C2 B1, B2, C1 fC1

C3 A, B1, B2 fA , fB1 , fB2

C4 B2, D1, D2 nothingC5 B2, C2, D2 fC2 , fD2

C6 D1, D2, E fE

where fX is the conditional distribution of the variable, X, in the belief network[Fig. 3(a)], But if we join the clusters C4 and C5 then all the clusters will haverelevant information [Fig. 3(c)].

IV. EXPERIMENTAL EVALUATION

In this section, we test the performance of the new family of hybrid methods.We have evaluated the three main algorithms with and without fusion comparingtheir behavior with the likelihood weighting algorithm, one of the known tech-niques giving better results, and appearing in most of the reported experiments.17,18

The graph we use is semirandomly generated. We do not intend that theresults of the experiment should also be valid for every situation. For that weshould carry out more extensive experiments. However, this test can give us anidea of the performance of the algorithms. In general, given the complexity ofthe problem we are addressing, we should not expect an algorithm which is thebest one. The performance of an algorithm will depend on the structure of theproblem we are solving. So with this test, we only show that the algorithmspresented in this article are competitive in the solution of some difficult cases. Theexperiment will also be valuable for comparing the three algorithms introduced.

In the results of the experiment, we give the run time for the algorithms.This time has to be considered only as of indicative value. The implementationof the algorithms can be optimized, significantly reducing the time.

A. Experiments and Selection of the Graph

We have constructed a network with 58 variables. All the variables arediscrete. 50 variables have been randomly generated and 8 variables have beeninserted by hand. We have intended to build a graph with a heterogeneousstructure, that is, with different aspects in different regions of the graph. Forthis purpose, a graph with 50 variables is built as the union of two networkswhich were randomly generated and joined by a link between two randomlyselected nodes (one in each network). The first subgraph has 29 variables andit has been generated as follows. The number of cases of the variables is chosenaccording to a Poisson distribution with the mean being equal to 2.5 (taking at

572 HERNA´

NDEZ AND MORAL

Figure 4. Experiments graph.

least two cases per variable). The structure of the first network is determined byincorporating the variables in the network in a given order. When a variable isadded then the number of parents is selected according to a Poisson distributionwith the mean being equal to 3.0. Then, the parents are selected randomly amongthe variables incorporated previously. The result is a network with a complicatedstructure. The conditional f.p. of each variable (given a configuration of theparents) is defined by the repartition of 1 among the cases of the variable,proportional to a uniform random number assigned to each one of the cases.The second graph is constructed exactly the same, but the number of parents isselected according to a Poisson distribution with a mean equal to 1.0. The resultis a network with a structure similar to a polytree. The 8 remaining variablesare Figure 3. This figure has been linked to the aforementioned networks bymeans of a single link. In this subgraph all variables have two cases. The resultinggraph can be seen in Figure 4.

We have done two experiments: In the first one, no observed variablehas been considered; in the second one, we have selected 10 variables and anobservation on each one of them was made. Both experiments have been repeated100 times in the same conditions in order to have a more precise estimation ofthe error.


The algorithms we have considered are the following:

lw: Likelihood weighting algorithm.Ep: Estimating potentials algorithm.Em: Estimating messages algorithm.Epm: Estimating potentials and messages. This is the algorithm in which the outgoing

messages are estimated taking into account all the incoming messages.

A cluster is considered as N-mc if the number of variables of the cluster isgreater than the average number of variables of the clusters in the junction tree,and it is considered to be DE otherwise. The size of the sample in N-mc clustersis 3 000.

All algorithms have been implemented in language C on a SUN Workstation.

B. Measuring the Error

For each running of each algorithm, and for each experiment, we havecalculated the average time in seconds and the error in the estimation of theprobabilities. For each variable Xi , the considered error is:18,26

G(Xi) 5 ! 1uViu

Oxi[Vi

(p(xi u xE) 2 p(xi u xE))2

p(xi u xE)(1 2 p(xi u xE))(28)

where p(xi u xE) is the real probability and p(xi u xE) is the estimated one.For the set of variables XI 5 (Xi)i[I , the error of the estimation is:

G(XI) 5 !Oi[I

G(Xi)2 (29)

Table I shows the average error and time for each algorithm in the firstexperiment. Table II shows the same data but in the second experiment.

C. Results Evaluation

In short, we can emphasize the following aspects. In general, fusion is notrecommended because it increases the time and the errors. We do not have anexplanation for this fact. In general we expected to increase the time, but we donot know why the error is worse. Perhaps the use of fusion is valid under certainconditions not present in this particular case.

Table I. Average error and time in the first experiment.

No Fusion FusionHybridAlgorithm Error Time Error Time

Ep 0.138330 31 sc. 0.168897 34 sc.Em 0.122363 89 sc. 0.158248 111 sc.Epm 0.126511 58 sc. 0.166122 71 sc.

Error Time

lw 0.171401 22 sc.

574 HERNA´

NDEZ AND MORAL

Table II. Average error and time in the second experiment.

No Fusion FusionHybridAlgorithm Error Time Error Time

Ep 0.043158 33 sc. 0.097190 37 sc.Em 0.024503 94 sc. 0.069180 116 sc.Epm 0.009781 62 sc. 0.066171 72 sc.

Error Time

lw 0.117951 24 sc.

The estimating potentials algorithm is the best when there is no evidencein the network (i.e., first experiment). In fact, the errors are similar for all thehybrid algorithms but this algorithm is the fastest one.

In general, the hybrid algorithms are better than the likelihood weightingalgorithm. The differences in error are more bigger when we have obserations.

Estimating the messages algorithm is slightly better than estimating poten-tials and messages when there is no evidence. The reverse is true in the secondexperiment, in which we have observations, as a general rule, estimating poten-tials and messages is a good choice as an algorithm.

The differences in time are not very important. In the worst case, estimatingpotentials and messages stakes only three times the time of likelihood weighting.This difference can depend on the graphs we are using, in such a way that in agraph in which we have only a small very complicated graph, hybrid algorithmscan be faster than likelihood weighting (exact calculation is faster than MonteCarlo when there is a small number of variables).

V. CONCLUSIONS

In this article we have introduced new algorithms based on Dawid et al.,21

hybrid methodology. At the moment, only one hybrid method had been devel-oped: HUGS. In that scheme Kjærulff uses the exact method Hugin and theGibbs sampler.22 Here we use the exact method of Shafer/Shenoy with somemodifications and the importance sampler developed by Cano, Hernandez,and Moral.18,19

We have proposed several variants of the algorithm and they have beencontrasted with the likelihood weighting simulation scheme. In general, the so-called estimating potentials and messages algorithm shows a very promisingperformance. Comparisons with Kjærulff’s hybrid methodology remains to bedone. In general, we should expect this technique to be more efficient thansimple Gibbs algorithms (likelihood weighting is better in general); however,Kjærulff uses a modification of Gibbs sampling called blocking Gibbs. The perfor-mance of the resulting hybrid algorithm depends on the selection of the appro-priate blocking strategy.

For the future, we think that the use of hybrid algorithms will not be limitedto one exact algorithm and one Monte Carlo algorithm. Different algorithms


can do the propagation of the information in different parts of the graph. Themost interesting topic and the most difficult one at the same time, will be thestudy of rules allowing us to select the most appropriate type of algorithm foreach cluster in the tree, taking into account the characteristics of the potentialsassigned to that group of variables.

More extensive experiments involving more strategies, such as splitting theeffort in several repetitions of the algorithm, and with a greater variety of graphsare also necessary to be able to assess the performance of the algorithms.

This work has been supported by the Commission of the European Communitiesunder ESPRIT III BRA 6156: DRUMS 2. We are very grateful to Jose E. Cano for hishelp in the implementation of the algorithms presented in this article and to the refereesfor their valuable and useful suggestions.

References

1. E.H. Shortliffe, Computer Based Medical Consultation: MYCIN, Elsevier, NewYork, 1976.

2. R.O. Duda, P.E. Hart, and N.J. Nilsson, ‘‘Subjective bayesian methods for rule basedinference systems,’’ Proceedings of the National Computer Conference (AFIPS) 45,1976, pp. 1075–1082.

3. J. Pearl, Probabilistic Reasoning in Intelligent Systems, Morgan & Kaufmann, SanMateo, 1988.

4. S.L. Lauritzen and D.J. Spiegelhalter, ‘‘Local computations with probabilistic ongraphical structures and their application to expert system,’’ J.R. Statistics Society.Series B., 50, 157–224 (1988).

5. R.D. Shachter, ‘‘Probabilistic inference and influence diagrams,’’ Operations Research,36, 589–604 (1988).

6. P.P. Shenoy, ‘‘A valuation-based language for expert systems,’’ Int. J. Approx. Reason-ing, 3, 383–411 (1989).

7. B. D’Ambrosio, Symbolic Probabilistic Inference in Belief Nets, Technical Report,Oregon State University, 1989.

2. G. Shafer and P.P. Shenoy, ‘‘Probability propagation,’’ Annals Math. Artif. Intell. 2,337–351 (1990).

9. R.D. Shachter, S.K. Andersen, and P. Szlovits, The Equivalence of Exact Methodsfor Probabilistic Inference on Belief Network, Technical Report, Department of Engi-neering–Economic Systems, Stanford University, 1991.

10. G.F. Copper, ‘‘The computational complexity of probabilistic inference using bayesianbelief networks,’’ Artif. Intell., 42, 393–405 (1990).

11. P. Dagum and M. Luby, ‘‘Approximating probabilistic inference in bayesian beliefnetworks is np-hard,’’ Artif. Intell., 60, 141–153 (1993).

12. J. Pearl, ‘‘Evidential reasoning using stochastic simulation of causal models,’’ Artif.Intell. 32, 247–257 (1987).

13. H.L. Chin and G.F. Cooper, ‘‘Bayesian belief network inference using simulation,’’in Uncertainty in Artificial Intelligence 3, L.N. Kanal, T.S. Levitt, and J.F. Lemmer,Eds., North-Holland, Amsterdam, 1989, pp. 129–148.

14. C.S. Jensen, A.Kong, and U. Kjærulff, ‘‘Blocking gibbs sampling in very large probabi-listic expert systems,’’ Report R-93-2031, Institute for Electronic Systems, Institutefor Electronic Systems, Aalborg University, October 1993.

15. M. Henrion, ‘‘Propagation of uncertainty by probabilistic logic sampling in Bayesnetworks,’’ in Uncertainty in Artificial Intelligence 2, J. Lemmer and L.N. Kanal, Eds.,North-Holland, Amsterdam, 1988, pp. 149–164.

576 HERNA´

NDEZ AND MORAL

16. R. Fung and K. Chang, ‘‘Weighing and integrating evidence for stochastic simulationin bayesian networks,’’ in: Uncertainty in Artificial Intelligence 5, M. Henrion, R.D.Shachter, L.N. Kanal, and J.F. Lemmer, Eds., North-Holland, Amsterdam, 1990,209–219.

17. R.D. Shachter and M.A. Peot, ‘‘Simulation approaches to general probabilistic infer-ence on belief networks,’’ in Uncertainty in Artificial Intelligence 5, M. Henrion, R.D.Shachter, L.N. Kanal, and J.F. Lemmer, Eds., North-Holland, Amsterdam, (1990),pp. 221–231.

18. J.E. Cano, L.D. Hernandez, and S. Moral, ‘‘Importance sampling algorithms for beliefnetworks,’’ Int. J. Approx. Reasoning, 15, 77–92 (1996).

19. L.D. Hernandez, ‘‘Diseno y validacion de nuevos algorithmos para el tratamiento degrafos de dependencias,’’ Ph.D. Thesis, Dpto. de C.C. e I.A., Facultad de Ciencias,Universidad de Granada, 1995. (In Spanish).

20. R. Fung and B. Del Favero, ‘‘Backward simulation in bayesian networks,’’ Proceedingsof the Tenth Conference on Uncertainty in Artificial Intelligence, R. Lopez de Mantarasand D. Poole, Eds., Morgan & Kaufmann, San Mateo, 1994, pp. 227–234.

21. A. P. Dawid, U. Kjærulff, and S.L. Lauritzen, ‘‘Hybrid propagation in junction trees,’’in Advances in Intelligent Computing, B.B. Bouchon, R. Yager, and L.A. Zadeh, Eds.,Springer-Verlag, Berlin, 1995, pp. 87–97.

22. U. Kjærulff, ‘‘Hugs: Combining exact inference and gibbs sampling in junction trees,’’Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, Ph.Besnard and S. Hanks, Eds., Morgan & Kaufmann, San Mateo, 1995, pp. 368–375.

23. F. Jensen, Implementation Aspects of Various Propagation Algorithms in Hugin, Tech-nical Report R 94-2014, Department of Mathematics and Computer Science, Institutefor Electronic Systems, Aalborg University, March 1994.

24. F.V. Jensen, S.L. Lauritzen, and K.G. Olesen, ‘‘Bayesian updating in causal probabilis-tic networks by local computations,’’ Computational Statistics Quarterly, 4, 269–282(1990).

25. R.Y. Rubinstein, Simulation and the Monte Carlo Method, Wiley, New York, 1981.26. K.W. Fertig and N.R. Mann, ‘‘An accurate approximation to the sampling distribution

of the studentized extreme-valued statistic,’’ Technometrics, 22, 83–90 (1980).

mixing exact and importance sampling propagation algorithms in

Documents