updating with incomplete observations (uai-2003)

Updating with incomplete observationsUpdating with incomplete observations(UAI-2003)(UAI-2003)

Marco Marco ZaffalonZaffalon

““Dalle Molle” Institute for Artificial Dalle Molle” Institute for Artificial IntelligenceIntelligence

SWITZERLANDSWITZERLANDhttp://www.idsia.ch/~zaffalonhttp://www.idsia.ch/~zaffalon

[email protected]@idsia.ch

Gert de Gert de CoomanCooman

SYSTeMS research groupSYSTeMS research groupBELGIUMBELGIUM

http://ippserv.ugent.be/~gerthttp://ippserv.ugent.be/[email protected]@ugent.be

IDSIA

22

What are incomplete What are incomplete observations?observations?

A simple exampleA simple example C (class) and A (attribute) are Boolean random C (class) and A (attribute) are Boolean random

variablesvariables C = 1 is the presence of a diseaseC = 1 is the presence of a disease A = 1 is the positive result of a medical testA = 1 is the positive result of a medical test

Let us do diagnosisLet us do diagnosis Good point: you know that Good point: you know that

p(C = 0, A = 0) = 0.99p(C = 0, A = 0) = 0.99 p(C = 1, A = 1) = 0.01p(C = 1, A = 1) = 0.01 Whence p(C = 0 | A = a) allows you to make a sure diagnosisWhence p(C = 0 | A = a) allows you to make a sure diagnosis

Bad point: the test result can be missingBad point: the test result can be missing This is an incomplete, or This is an incomplete, or set-valuedset-valued, observation {0,1} for A, observation {0,1} for A

What is p(C = 0 | A is missing)?What is p(C = 0 | A is missing)?

33

Example ctdExample ctd

Kolmogorov’s Kolmogorov’s definitiondefinition of conditional probability of conditional probability seemsseems to say to say p(C = 0 | A p(C = 0 | A {0,1}) = p(C = 0) = 0.99 {0,1}) = p(C = 0) = 0.99 i.e., with high probability the patient is healthyi.e., with high probability the patient is healthy

Is this right?Is this right? In general, In general, it is notit is not Why?Why?

44

Why?Why?

Because A can be Because A can be selectivelyselectively reported reported e.g., the medical test machine is broken;e.g., the medical test machine is broken;

it produces an output it produces an output the test is negative (A = 0) the test is negative (A = 0) In this case p(C = 0 | A is missing) = p(C = 0 | A = 1) = 0In this case p(C = 0 | A is missing) = p(C = 0 | A = 1) = 0 The patient is definitely ill!The patient is definitely ill! Compare this with the former naive application ofCompare this with the former naive application of

Kolmogorov’s updating (or Kolmogorov’s updating (or naive updatingnaive updating, for short), for short)

55

Modeling it the right wayModeling it the right way

Observations-generating modelObservations-generating model

o is a generic value for O, another random variableo is a generic value for O, another random variable o can be 0, 1, or * (i.e., missing value for A)o can be 0, 1, or * (i.e., missing value for A)

IM = p(O | C,A) should not be neglected!IM = p(O | C,A) should not be neglected!

The correct The correct overalloverall model we need is p(C,A)p(O | model we need is p(C,A)p(O | C,A)C,A)

p(C,A)(c,a)

Distribution generating pairs for (C,A)

Complete pair (not observed)

IM

Incompleteness Mechanism (IM)

o

Actual observation (o) about A

66

What about Bayesian nets (BNs)?What about Bayesian nets (BNs)?

Asia netAsia net

Let us predict C on the basis of the observation (L,S,T) = (y,y,n)Let us predict C on the basis of the observation (L,S,T) = (y,y,n)

BN BN updatingupdating instructs us to use p(C | L = y,S = y,T = n) to instructs us to use p(C | L = y,S = y,T = n) to predict Cpredict C

(T)uberculosis = n

(V)isit to Asia (S)moking = y

Lung (C)ancer? Bronc(H)itis

Abnorma(L) X-rays = y

(D)yspnea

77

Asia ctdAsia ctd

Should we really use p(C | L = y,S = y,T = n) to predict Should we really use p(C | L = y,S = y,T = n) to predict C?C?

(V,H,D) is missing(V,H,D) is missing

(L,S,T,V,H,D) = (y,y,n,*,*,*) (L,S,T,V,H,D) = (y,y,n,*,*,*) is an incomplete is an incomplete observationobservation

p(C | L = y,S = y,T = n) is just the naive updatingp(C | L = y,S = y,T = n) is just the naive updating By using the naive updating, we are neglecting the IM!By using the naive updating, we are neglecting the IM!

Wrong inference in generalWrong inference in general

88

New problem?New problem?

Problems with naive updating were already clear Problems with naive updating were already clear since 1985 at least (Shafer)since 1985 at least (Shafer)

Practical consequences were not so clearPractical consequences were not so clear How often does naive updating make problems?How often does naive updating make problems? Perhaps it is not a problem in practice?Perhaps it is not a problem in practice?

99

Grünwald & Halpern (UAI-2002) Grünwald & Halpern (UAI-2002) on naive updatingon naive updating

Three points made stronglyThree points made strongly1)1) naive updating works naive updating works CAR holds CAR holds

i.e., neglecting the IM is correct i.e., neglecting the IM is correct CAR holds CAR holds With missing data:With missing data:

CAR (coarsening at random) = MAR (missing at random) =CAR (coarsening at random) = MAR (missing at random) =p(A is missing | c,a) is the same for all pairs (c,a)p(A is missing | c,a) is the same for all pairs (c,a)

2)2) CAR holds rather infrequentlyCAR holds rather infrequently

3)3) The IM, p(O | C,A), can be difficult to modelThe IM, p(O | C,A), can be difficult to model

2 & 3 = serious theoretical & practical problem2 & 3 = serious theoretical & practical problem

How should we do updating given 2 & 3?How should we do updating given 2 & 3?

1010

What this paper is aboutWhat this paper is about

Have a conservative (i.e., robust) point of viewHave a conservative (i.e., robust) point of view Deliberately worst case, as opposed to the MAR best Deliberately worst case, as opposed to the MAR best

casecase

Assume little knowledge about the IMAssume little knowledge about the IM You are not allowed to assume MARYou are not allowed to assume MAR You are not able/willing to model the IM explicitlyYou are not able/willing to model the IM explicitly

Derive an updating rule for this important caseDerive an updating rule for this important case Conservative updating ruleConservative updating rule

1111

11stst step: plug ignorance into your step: plug ignorance into your modelmodel

Fact: the IM is unknownFact: the IM is unknown p(Op(O{0,1,*} | C,A) = 1{0,1,*} | C,A) = 1

a constraint on p(O | C,A) a constraint on p(O | C,A) i.e. any distribution i.e. any distribution

p(O | C,A) is possiblep(O | C,A) is possible This is too conservative;This is too conservative;

to draw useful conclusionsto draw useful conclusionswe need a little less ignorancewe need a little less ignorance

Consider the set of all p(O | C,A) s.t. p(O | C,A) = p(O | Consider the set of all p(O | C,A) s.t. p(O | C,A) = p(O | A)A) i.e., all the IMs which do i.e., all the IMs which do notnot depend on what you want to predict depend on what you want to predict

Use this set of IMs jointly with prior information p(C,A)Use this set of IMs jointly with prior information p(C,A)

p(C,A)(c,a)

Known prior distribution

Complete pair (not observed)

IM

Unknown Incompleteness Mechanism

o

Actual observation (o) about A

1212

22ndnd step: derive the conservative step: derive the conservative updatingupdating

Let E = evidence = observed variables, in state eLet E = evidence = observed variables, in state e Let R = remaining unobserved variables (except C)Let R = remaining unobserved variables (except C)

Formal derivation yields:Formal derivation yields:1)1) All the values for R should be consideredAll the values for R should be considered2)2) In particular, updating becomes:In particular, updating becomes:

Conservative Updating RuleConservative Updating Rule (CUR)(CUR)

minminrrRR p(c | E = e,R = r)p(c | E = e,R = r) p(c | o) p(c | o) max maxrrRR p(c | E = p(c | E = e,R = r)e,R = r)

1313

Evidence: (L,S,T) = (y,y,n) Evidence: (L,S,T) = (y,y,n)

What is your posterior What is your posterior confidence on C = y?confidence on C = y?

Consider all the jointConsider all the jointvalues of nodes in Rvalues of nodes in RTake min & max of p(C = y | L = y,S = y,T = n,v,h,d) Take min & max of p(C = y | L = y,S = y,T = n,v,h,d)

Posterior confidence Posterior confidence [0.42,0.71] [0.42,0.71]

Computational note: Computational note: only Markov blanket mattersonly Markov blanket matters!!

CUR & Bayesian netsCUR & Bayesian nets

(T)uberculosis = n




(D)yspnea

1414

A few remarksA few remarks

The CUR…The CUR… is based is based onlyonly on p(C,A), like the naive updating on p(C,A), like the naive updating produces lower & upper probabilitiesproduces lower & upper probabilities can produce indecisioncan produce indecision

1515

CUR & decision-makingCUR & decision-making

DecisionsDecisions c’ c’ dominatesdominates c’’ (c’,c’’ c’’ (c’,c’’ CC) if ) if for all rfor all r RR , ,

p(c’ | E = e, R = r) > p(c’’ | E = e, R = r)p(c’ | E = e, R = r) > p(c’’ | E = e, R = r)

Indecision?Indecision? It may happen that It may happen that r’,r’’r’,r’’ RR so that: so that:

p(c’ | E = e, R = r’) > p(c’’ | E = e, R = r’)p(c’ | E = e, R = r’) > p(c’’ | E = e, R = r’)andand

p(c’ | E = e, R = r’’) < p(c’’ | E = e, R = r’’)p(c’ | E = e, R = r’’) < p(c’’ | E = e, R = r’’)

There is no evidence that you should prefer c’ to c’’ and vice There is no evidence that you should prefer c’ to c’’ and vice versaversa

(= keep both)(= keep both)

1616

Decision-making exampleDecision-making example

Evidence: Evidence: E = (L,S,T) = (y,y,E = (L,S,T) = (y,y,nn) = e) = e

What is your What is your diagnosisdiagnosis for C? for C? p(C = y | E = e, H = n, D = y) > p(C = n | E = e, H = n, D = y) p(C = y | E = e, H = n, D = y) > p(C = n | E = e, H = n, D = y) p(C = y | E = e, H = n, D = n) < p(C = n | E = e, H = n, D = n) p(C = y | E = e, H = n, D = n) < p(C = n | E = e, H = n, D = n) Both C = y and C = n are plausibleBoth C = y and C = n are plausible

Evidence:Evidence:E = (L,S,T) = (y,y,E = (L,S,T) = (y,y,yy) = e) = e

C = n C = n dominatesdominates C = y: “cancer” is ruled out C = y: “cancer” is ruled out

(T)uberculosis




(D)yspnea

1717

Algorithmic factsAlgorithmic facts

CUR CUR restrict attention to Markov blanket restrict attention to Markov blanket State enumeration still prohibitive in some casesState enumeration still prohibitive in some cases

e.g., naive Bayese.g., naive Bayes

Dominance test based on dynamic programmingDominance test based on dynamic programming Linear in the number of children of class node CLinear in the number of children of class node C

However:However:decision-making possible in decision-making possible in linear timelinear time, ,

by provided algorithm, even on some multiply by provided algorithm, even on some multiply connected nets!connected nets!

1818

On the application sideOn the application side

Important characteristics of present approachImportant characteristics of present approach Robust approach, easy to implementRobust approach, easy to implement Does not require changes in pre-existing BN knowledge basesDoes not require changes in pre-existing BN knowledge bases

based on p(C,A) only!based on p(C,A) only! Markov blanket Markov blanket favors low computational complexity favors low computational complexity If you can write down the IM explicitly, your If you can write down the IM explicitly, your

decisions/inferences will be contained in oursdecisions/inferences will be contained in ours By-product for large networksBy-product for large networks

Even when naive updating is OK, CUR can serve as a useful Even when naive updating is OK, CUR can serve as a useful preprocessing phasepreprocessing phase

Restricting attention to Markov blanket may produce strong enough Restricting attention to Markov blanket may produce strong enough inferences and decisionsinferences and decisions

1919

What we did in the paperWhat we did in the paper

Theory of Theory of coherent lower previsionscoherent lower previsions ( (imprecise imprecise probabilitiesprobabilities)) CoherenceCoherence

Equivalent to a large extent to sets of probability Equivalent to a large extent to sets of probability distributionsdistributions

Weaker assumptionsWeaker assumptions

CUR derived in quite a general frameworkCUR derived in quite a general framework

2020

Concluding notesConcluding notes

There are cases when:There are cases when: IM is unknown/difficult to modelIM is unknown/difficult to model MAR does not holdMAR does not hold

Serious theoretical and practical problemSerious theoretical and practical problem

CUR appliesCUR applies Robust to the unknown IMRobust to the unknown IM Computationally easy decision-making with BNsComputationally easy decision-making with BNs

CUR works with credal nets, tooCUR works with credal nets, too Same complexitySame complexity

Future: how to make stronger inferences and decisionsFuture: how to make stronger inferences and decisions Hybrid MAR/non-MAR modeling?Hybrid MAR/non-MAR modeling?

updating with incomplete observations (uai-2003)

Documents