updating with incomplete observations (uai-2003)
DESCRIPTION
IDSIA. Updating with incomplete observations (UAI-2003). Gert de Cooman. Marco Zaffalon. “Dalle Molle” Institute for Artificial Intelligence SWITZERLAND http://www.idsia.ch/~zaffalon [email protected]. SYSTeMS research group BELGIUM http://ippserv.ugent.be/~gert [email protected]. - PowerPoint PPT PresentationTRANSCRIPT
Updating with incomplete observationsUpdating with incomplete observations(UAI-2003)(UAI-2003)
Marco Marco ZaffalonZaffalon
““Dalle Molle” Institute for Artificial Dalle Molle” Institute for Artificial IntelligenceIntelligence
SWITZERLANDSWITZERLANDhttp://www.idsia.ch/~zaffalonhttp://www.idsia.ch/~zaffalon
[email protected]@idsia.ch
Gert de Gert de CoomanCooman
SYSTeMS research groupSYSTeMS research groupBELGIUMBELGIUM
http://ippserv.ugent.be/~gerthttp://ippserv.ugent.be/[email protected]@ugent.be
IDSIA
22
What are incomplete What are incomplete observations?observations?
A simple exampleA simple example C (class) and A (attribute) are Boolean random C (class) and A (attribute) are Boolean random
variablesvariables C = 1 is the presence of a diseaseC = 1 is the presence of a disease A = 1 is the positive result of a medical testA = 1 is the positive result of a medical test
Let us do diagnosisLet us do diagnosis Good point: you know that Good point: you know that
p(C = 0, A = 0) = 0.99p(C = 0, A = 0) = 0.99 p(C = 1, A = 1) = 0.01p(C = 1, A = 1) = 0.01 Whence p(C = 0 | A = a) allows you to make a sure diagnosisWhence p(C = 0 | A = a) allows you to make a sure diagnosis
Bad point: the test result can be missingBad point: the test result can be missing This is an incomplete, or This is an incomplete, or set-valuedset-valued, observation {0,1} for A, observation {0,1} for A
What is p(C = 0 | A is missing)?What is p(C = 0 | A is missing)?
33
Example ctdExample ctd
Kolmogorov’s Kolmogorov’s definitiondefinition of conditional probability of conditional probability seemsseems to say to say p(C = 0 | A p(C = 0 | A {0,1}) = p(C = 0) = 0.99 {0,1}) = p(C = 0) = 0.99 i.e., with high probability the patient is healthyi.e., with high probability the patient is healthy
Is this right?Is this right? In general, In general, it is notit is not Why?Why?
44
Why?Why?
Because A can be Because A can be selectivelyselectively reported reported e.g., the medical test machine is broken;e.g., the medical test machine is broken;
it produces an output it produces an output the test is negative (A = 0) the test is negative (A = 0) In this case p(C = 0 | A is missing) = p(C = 0 | A = 1) = 0In this case p(C = 0 | A is missing) = p(C = 0 | A = 1) = 0 The patient is definitely ill!The patient is definitely ill! Compare this with the former naive application ofCompare this with the former naive application of
Kolmogorov’s updating (or Kolmogorov’s updating (or naive updatingnaive updating, for short), for short)
55
Modeling it the right wayModeling it the right way
Observations-generating modelObservations-generating model
o is a generic value for O, another random variableo is a generic value for O, another random variable o can be 0, 1, or * (i.e., missing value for A)o can be 0, 1, or * (i.e., missing value for A)
IM = p(O | C,A) should not be neglected!IM = p(O | C,A) should not be neglected!
The correct The correct overalloverall model we need is p(C,A)p(O | model we need is p(C,A)p(O | C,A)C,A)
p(C,A)(c,a)
Distribution generating pairs for (C,A)
Complete pair (not observed)
IM
Incompleteness Mechanism (IM)
o
Actual observation (o) about A
66
What about Bayesian nets (BNs)?What about Bayesian nets (BNs)?
Asia netAsia net
Let us predict C on the basis of the observation (L,S,T) = (y,y,n)Let us predict C on the basis of the observation (L,S,T) = (y,y,n)
BN BN updatingupdating instructs us to use p(C | L = y,S = y,T = n) to instructs us to use p(C | L = y,S = y,T = n) to predict Cpredict C
(T)uberculosis = n
(V)isit to Asia (S)moking = y
Lung (C)ancer? Bronc(H)itis
Abnorma(L) X-rays = y
(D)yspnea
77
Asia ctdAsia ctd
Should we really use p(C | L = y,S = y,T = n) to predict Should we really use p(C | L = y,S = y,T = n) to predict C?C?
(V,H,D) is missing(V,H,D) is missing
(L,S,T,V,H,D) = (y,y,n,*,*,*) (L,S,T,V,H,D) = (y,y,n,*,*,*) is an incomplete is an incomplete observationobservation
p(C | L = y,S = y,T = n) is just the naive updatingp(C | L = y,S = y,T = n) is just the naive updating By using the naive updating, we are neglecting the IM!By using the naive updating, we are neglecting the IM!
Wrong inference in generalWrong inference in general
88
New problem?New problem?
Problems with naive updating were already clear Problems with naive updating were already clear since 1985 at least (Shafer)since 1985 at least (Shafer)
Practical consequences were not so clearPractical consequences were not so clear How often does naive updating make problems?How often does naive updating make problems? Perhaps it is not a problem in practice?Perhaps it is not a problem in practice?
99
Grünwald & Halpern (UAI-2002) Grünwald & Halpern (UAI-2002) on naive updatingon naive updating
Three points made stronglyThree points made strongly1)1) naive updating works naive updating works CAR holds CAR holds
i.e., neglecting the IM is correct i.e., neglecting the IM is correct CAR holds CAR holds With missing data:With missing data:
CAR (coarsening at random) = MAR (missing at random) =CAR (coarsening at random) = MAR (missing at random) =p(A is missing | c,a) is the same for all pairs (c,a)p(A is missing | c,a) is the same for all pairs (c,a)
2)2) CAR holds rather infrequentlyCAR holds rather infrequently
3)3) The IM, p(O | C,A), can be difficult to modelThe IM, p(O | C,A), can be difficult to model
2 & 3 = serious theoretical & practical problem2 & 3 = serious theoretical & practical problem
How should we do updating given 2 & 3?How should we do updating given 2 & 3?
1010
What this paper is aboutWhat this paper is about
Have a conservative (i.e., robust) point of viewHave a conservative (i.e., robust) point of view Deliberately worst case, as opposed to the MAR best Deliberately worst case, as opposed to the MAR best
casecase
Assume little knowledge about the IMAssume little knowledge about the IM You are not allowed to assume MARYou are not allowed to assume MAR You are not able/willing to model the IM explicitlyYou are not able/willing to model the IM explicitly
Derive an updating rule for this important caseDerive an updating rule for this important case Conservative updating ruleConservative updating rule
1111
11stst step: plug ignorance into your step: plug ignorance into your modelmodel
Fact: the IM is unknownFact: the IM is unknown p(Op(O{0,1,*} | C,A) = 1{0,1,*} | C,A) = 1
a constraint on p(O | C,A) a constraint on p(O | C,A) i.e. any distribution i.e. any distribution
p(O | C,A) is possiblep(O | C,A) is possible This is too conservative;This is too conservative;
to draw useful conclusionsto draw useful conclusionswe need a little less ignorancewe need a little less ignorance
Consider the set of all p(O | C,A) s.t. p(O | C,A) = p(O | Consider the set of all p(O | C,A) s.t. p(O | C,A) = p(O | A)A) i.e., all the IMs which do i.e., all the IMs which do notnot depend on what you want to predict depend on what you want to predict
Use this set of IMs jointly with prior information p(C,A)Use this set of IMs jointly with prior information p(C,A)
p(C,A)(c,a)
Known prior distribution
Complete pair (not observed)
IM
Unknown Incompleteness Mechanism
o
Actual observation (o) about A
1212
22ndnd step: derive the conservative step: derive the conservative updatingupdating
Let E = evidence = observed variables, in state eLet E = evidence = observed variables, in state e Let R = remaining unobserved variables (except C)Let R = remaining unobserved variables (except C)
Formal derivation yields:Formal derivation yields:1)1) All the values for R should be consideredAll the values for R should be considered2)2) In particular, updating becomes:In particular, updating becomes:
Conservative Updating RuleConservative Updating Rule (CUR)(CUR)
minminrrRR p(c | E = e,R = r)p(c | E = e,R = r) p(c | o) p(c | o) max maxrrRR p(c | E = p(c | E = e,R = r)e,R = r)
1313
Evidence: (L,S,T) = (y,y,n) Evidence: (L,S,T) = (y,y,n)
What is your posterior What is your posterior confidence on C = y?confidence on C = y?
Consider all the jointConsider all the jointvalues of nodes in Rvalues of nodes in RTake min & max of p(C = y | L = y,S = y,T = n,v,h,d) Take min & max of p(C = y | L = y,S = y,T = n,v,h,d)
Posterior confidence Posterior confidence [0.42,0.71] [0.42,0.71]
Computational note: Computational note: only Markov blanket mattersonly Markov blanket matters!!
CUR & Bayesian netsCUR & Bayesian nets
(T)uberculosis = n
(V)isit to Asia (S)moking = y
Lung (C)ancer? Bronc(H)itis
Abnorma(L) X-rays = y
(D)yspnea
1414
A few remarksA few remarks
The CUR…The CUR… is based is based onlyonly on p(C,A), like the naive updating on p(C,A), like the naive updating produces lower & upper probabilitiesproduces lower & upper probabilities can produce indecisioncan produce indecision
1515
CUR & decision-makingCUR & decision-making
DecisionsDecisions c’ c’ dominatesdominates c’’ (c’,c’’ c’’ (c’,c’’ CC) if ) if for all rfor all r RR , ,
p(c’ | E = e, R = r) > p(c’’ | E = e, R = r)p(c’ | E = e, R = r) > p(c’’ | E = e, R = r)
Indecision?Indecision? It may happen that It may happen that r’,r’’r’,r’’ RR so that: so that:
p(c’ | E = e, R = r’) > p(c’’ | E = e, R = r’)p(c’ | E = e, R = r’) > p(c’’ | E = e, R = r’)andand
p(c’ | E = e, R = r’’) < p(c’’ | E = e, R = r’’)p(c’ | E = e, R = r’’) < p(c’’ | E = e, R = r’’)
There is no evidence that you should prefer c’ to c’’ and vice There is no evidence that you should prefer c’ to c’’ and vice versaversa
(= keep both)(= keep both)
1616
Decision-making exampleDecision-making example
Evidence: Evidence: E = (L,S,T) = (y,y,E = (L,S,T) = (y,y,nn) = e) = e
What is your What is your diagnosisdiagnosis for C? for C? p(C = y | E = e, H = n, D = y) > p(C = n | E = e, H = n, D = y) p(C = y | E = e, H = n, D = y) > p(C = n | E = e, H = n, D = y) p(C = y | E = e, H = n, D = n) < p(C = n | E = e, H = n, D = n) p(C = y | E = e, H = n, D = n) < p(C = n | E = e, H = n, D = n) Both C = y and C = n are plausibleBoth C = y and C = n are plausible
Evidence:Evidence:E = (L,S,T) = (y,y,E = (L,S,T) = (y,y,yy) = e) = e
C = n C = n dominatesdominates C = y: “cancer” is ruled out C = y: “cancer” is ruled out
(T)uberculosis
(V)isit to Asia (S)moking = y
Lung (C)ancer? Bronc(H)itis
Abnorma(L) X-rays = y
(D)yspnea
1717
Algorithmic factsAlgorithmic facts
CUR CUR restrict attention to Markov blanket restrict attention to Markov blanket State enumeration still prohibitive in some casesState enumeration still prohibitive in some cases
e.g., naive Bayese.g., naive Bayes
Dominance test based on dynamic programmingDominance test based on dynamic programming Linear in the number of children of class node CLinear in the number of children of class node C
However:However:decision-making possible in decision-making possible in linear timelinear time, ,
by provided algorithm, even on some multiply by provided algorithm, even on some multiply connected nets!connected nets!
1818
On the application sideOn the application side
Important characteristics of present approachImportant characteristics of present approach Robust approach, easy to implementRobust approach, easy to implement Does not require changes in pre-existing BN knowledge basesDoes not require changes in pre-existing BN knowledge bases
based on p(C,A) only!based on p(C,A) only! Markov blanket Markov blanket favors low computational complexity favors low computational complexity If you can write down the IM explicitly, your If you can write down the IM explicitly, your
decisions/inferences will be contained in oursdecisions/inferences will be contained in ours By-product for large networksBy-product for large networks
Even when naive updating is OK, CUR can serve as a useful Even when naive updating is OK, CUR can serve as a useful preprocessing phasepreprocessing phase
Restricting attention to Markov blanket may produce strong enough Restricting attention to Markov blanket may produce strong enough inferences and decisionsinferences and decisions
1919
What we did in the paperWhat we did in the paper
Theory of Theory of coherent lower previsionscoherent lower previsions ( (imprecise imprecise probabilitiesprobabilities)) CoherenceCoherence
Equivalent to a large extent to sets of probability Equivalent to a large extent to sets of probability distributionsdistributions
Weaker assumptionsWeaker assumptions
CUR derived in quite a general frameworkCUR derived in quite a general framework
2020
Concluding notesConcluding notes
There are cases when:There are cases when: IM is unknown/difficult to modelIM is unknown/difficult to model MAR does not holdMAR does not hold
Serious theoretical and practical problemSerious theoretical and practical problem
CUR appliesCUR applies Robust to the unknown IMRobust to the unknown IM Computationally easy decision-making with BNsComputationally easy decision-making with BNs
CUR works with credal nets, tooCUR works with credal nets, too Same complexitySame complexity
Future: how to make stronger inferences and decisionsFuture: how to make stronger inferences and decisions Hybrid MAR/non-MAR modeling?Hybrid MAR/non-MAR modeling?