introduction to bayesian belief nets
DESCRIPTION
Introduction to Bayesian Belief Nets. Russ Greiner Dep’t of Computing Science Alberta Ingenuity Centre for Machine Learning University of Alberta http://www.cs.ualberta.ca/~greiner/bn.html. 1996 …. 1990 …. 1980. Motivation. Gates says [LATimes, 28/Oct/96]: - PowerPoint PPT PresentationTRANSCRIPT
Introduction toBayesian Belief Nets
Russ GreinerDep’t of Computing Science
Alberta Ingenuity Centre for Machine LearningUniversity of Alberta
http://www.cs.ualberta.ca/~greiner/bn.html
2
1996 …
1990 … 1980
3
4
Motivation
Gates says [LATimes, 28/Oct/96]:
Microsoft’s competitive advantages is its expertise in “Bayesian networks”
Current Products Microsoft Pregnancy and Child Care (MSN) Answer Wizard (Office, …) Print Troubleshooter
Excel Workbook TroubleshooterOffice 95 Setup Media TroubleshooterWindows NT 4.0 Video TroubleshooterWord Mail Merge Troubleshooter
5
Motivation (II)
US Army: SAIP (Battalion Detection from SAR, IR… GulfWar)
NASA: Vista (DSS for Space Shuttle) GE: Gems (real-time monitor for utility
generators) Intel: (infer possible processing problems from end-of-line
tests on semiconductor chips) KIC:
medical: sleep disorders, pathology, trauma care, hand and wrist evaluations, dermatology, home-based health evaluations
DSS for capital equipment: locomotives, gas-turbine engines, office equipment
6
Motivation (III) Lymph-node pathology diagnosis Manufacturing control Software diagnosis Information retrieval Types of tasks
Classification/Regression Sensor Fusion Prediction/Forecasting
7
Outline Existing uses of Belief Nets (BNs) How to reason with BNs Specific Examples of BNs Contrast with Rules, Neural Nets,
… Possible applications of BNs Challenges
How to reason efficiently How to learn BNs
8
Blah blah ouch yak ouch blah ouch blahblah ouch blah
SymptomsSymptomsChief complaintHistory, …
SignsSigns
Physical ExamTest results, …
DiagnosisPlanPlan
Treatment, …
9
Objectives: Decision Support System
Determine which tests to perform which repair to suggest
based on costs, sensitivity/specificity, …
Use all sources of information symbolic (discrete observations, history,
…) signal (from sensors)
Handle partial information Adapt to track fault distribution
10
Underlying Task Situation: Given observations {O1=v1, … Ok=vk}
(symptoms, history, test results, …)
what is best DIAGNOSIS Dxi for patient? Approach1Approach1:: Use set of obs1 & … & obsm Dxi rules
Seldom Completely Certain
but… Need rule for each situation for each diagnosis Dxr
for each set of possible values vj for Oj
for each subset of obs. {Ox1, Ox2, … } {Oj}Can’t use
if only know Temp and BP
If Temp>100 & BP = High & Cough = Yes DiseaseX
11
Underlying Task, II
Situation: Given observations {O1=v1, … Ok=vk}
(symptoms, history, test results, …)what is best DIAGNOSIS Dxi for patient?
Challenge: How to express Probabilities?
Approach 2Approach 2: Compute Probabilities of Dxi
given observations { obsj }
P( Dx = u | O1= v1, …, Ok= vk )
12• But… even if binary Dx, 20 binary obs.’s. >2,097,000 numbers!
P( Dx=T, O1=T, O2=T, …, ON=T ) = 0.03
P( Dx=T, O1=T, O2=T, …, ON=F ) = 0.4 …P( Dx=T, O1=F, O2=F, … , ON=T ) = 0
…P( Dx=F, O1=F, O2=F, …, ON=F ) = 0.01
• Then: Marginalize:
Conditionalize:
P( Dx = u, O1= v1,…,Ok= vk ) = Σ P( Dx = u , O1= v1 , …, Ok= vk, …, ON= vN )
P(O1 = v1,…, Ok = vk | Dx = u ) P( Dx = u, O1 = v1,…,Ok = vk )P( O1 = v1,…,Ok = vk)
P( Dx = u, O1=v1,..., Ok= vk,…, ON=vN )
Sufficient: “atomic events”:
for all 21+N values u {T, F}, vj {T, F}
How to deal with Probabilities
13
Problems with “Atomic Events”
Representation is not intuitive
Should make “connections” explicituse “local information”
Too many numbers – O(2N) Hard to store Hard to use
[Must add 2r values to marginalize r variables]
Hard to learn[Takes O(2N) samples to learn 2N
parameters]
Include only necessary “connections”Belief Nets
P(Jaundice | Hepatitis), P(LightDim | BadBattery),…
14
? Hepatitis?
? Hepatitis, not Jaunticed but +BloodTest?
Jaunticed
BloodTest
15
Hepatitis Example• (Boolean)
Variables:
H HepatitisJ JaundiceB (positive) Blood test
• Want P( H=1 | J=0, B=1 ) …, P(H=1 | B=1, J=1), P(H=1 | B=0,J=0), …
• Alternatively…
Option 1:
J B H P(J, B, H)0 0 0 0.033950 0 1 0.00950 1 0 0.00030 1 1 0.18051 0 0 0.014551 0 1 0.0382 1 0 0.000451 1 1 0.722
…Marginalize/Conditionalize, to get P( H=1 | J=0, B=1 ) …
16
Encoding Causal Links Simple Belief Net:
0.950.05
P(H=0)P(H=1)
0.970.030
0.050.951
P(B=0 | H=h)P(B=1 | H=h)h
0.70.300
0.7
0.2
0.2
P(J=0|h,b)
0.310
0.801
0.811
P(J=1|h,b)bh
H
B
J
Node ~ VariableLink ~ “Causal dependency”
“CPTable” ~ P(child | parents)
17
H
B
J
P(J | H, B=0) = P(J | H, B=1) J, H ! P( J | H, B) = P(J | H)
J is INDEPENDENT of B, once we know HDon’t need B J arc!
h P(B=1 | H=h)
1 0.95
0 0.03
P(H=1)
0.05
h b P(J=1|h , b )
1 1 0.8
1 0 0.8
0 1 0.3
0 0 0.3
Encoding Causal Links
18
H
B
J
P(J | H, B=0) = P(J | H, B=1) J, H ! P( J | H, B) = P(J | H)
J is INDEPENDENT of B, once we know HDon’t need B J arc!
h P(B=1 | H=h)
1 0.95
0 0.03
P(H=1)
0.05
h P(J=1|h )
1 0.8
1
0 0.3
0
Encoding Causal Links
19
H
B
J
P(J | H, B=0) = P(J | H, B=1) J, H ! P( J | H, B) = P(J | H)
J is INDEPENDENT of B, once we know HDon’t need B J arc!
h P(B=1 | H=h)
1 0.95
0 0.03
P(H=1)
0.05
h P(J=1|h )
1 0.8
0 0.3
Encoding Causal Links
20
Sufficient Belief NetH
B
J
h P(B=1 | H=h)
1 0.95
0 0.03
P(H=1)
0.05
h P(J=1|h )
1 0.8
0 0.3
Requires: P(H=1) knownP(J=1 | H=1) knownP(B=1 | H=1) known
(Only 5 parameters, not 7)
Hence: P(H=1 | B=1, J=0 ) = P(H=1) P(B=1 | H=1) P(J=0 |B=1,H=1) 1
P(J=0 | H=1)
21
“Factoring”
B does depend on J:If J=1, then likely that H=1 B =1
but… ONLY THROUGH H: If know H=1, then likely that B=1… doesn’t matter whether J=1 or J=0 !
P(J=0 | B=1, H=1) = P(J=0 | H=1)
N.b., B and J ARE correlated a priori P(J | B ) P(J)GIVEN H, they become uncorrelated P(J | B, H) = P(J | H)
H
B
J
22
Factored Distribution Symptoms independent, given Disease
ReadingAbility and ShoeSize are dependent,P(ReadAbility | ShoeSize ) P(ReadAbility )
but become independent, given AgeP(ReadAbility | ShoeSize, Age ) = P(ReadAbility | Age)
H HepatitisJ JaundiceB (positive) Blood test
P( B | J ) P ( B ) butP( B | J,H ) = P ( B | H )
Age
ShoeSize Reading
23
Find argmax {hi}
Given
P(H = hi )P(Oj = vj | H = hi)
Independent: P(Oj | H, Ok,…) = P(Oj | H)
H
O2O1 On...
j
ijjinni hHvOPhHPvOvOhHP )|()(1
)...,|( 11
Classification Task:Given { O1 = v1, …, On = vn }Find hi that maximizes (H = hi | O1 = v1, …, On = vn)
“Naïve Bayes”
24
Naïve Bayes (con’t)
Normalizing term
(No need to compute, as same for all hi)
Easy to use for Classification
Can use even if some vjs not specified
)|()(1
)...,|( 11 ij jjinni hHvOPhHPvOvOhHP
i j
ijjinn hHvOPhHPvOvOP )|()(),...,( 11
If k Dx’s and n Ois,requires only k priors, n * k pairwise-conditionals
(Not 2n+k… relatively easy to learn)
2,147,438,6476130
2,0472110
2n+1 – 11+2nn
H
O2O1 On...
25
Bigger Networks
Intuition: Show CAUSAL connections:GeneticPH CAUSES Hepatitis; Hepatitis CAUSES Jaundice
But only via Hepatitis: GeneticPH and not Hepatitis Jaundice
P( J | G ) P( J ) butP( J | G,H ) = P( J | H)
h P(J=1| h )
1 0.8
0 0.3
h P(B=1| h )
1 0.98
0 0.01
g lt P(H=1|g ,lt )
1 1 0.82
1 0 0.10
0 1 0.45
0 0 0.04
If GeneticPH, then expect Jaundice: GeneticPH Hepatitis Jaundice
LiverTrauma
Jaundice
GeneticPH
Hepatitis
Bloodtest
P(I=1)
0.20P(H=1)
0.32
26
Belief Nets DAG structure
Each node Variable v v depends (only) on its parents
+ conditional prob: P(vi | parenti = 0,1,… ) v is INDEPENDENT of non-descendants, given assignments to its parents
Given H = 1,- D has no influence on J- J has no influence on B- etc.
D I
H
J B
27
Less Trivial Situations• N.b., obs1 is not always independent of obs2 given H
• Eg, FamilyHistoryDepression ‘causes’ MotherSuicide and Depression
MotherSuicide causes Depression (w/ or w/o F.H.Depression)
• Here, P( D | MS, FHD ) P( D | FHD ) ! Can be done using Belief Network,
but need to specify:P( FHD ) 1P( MS | FHD ) 2P( D | MS, FHD ) 4
FHD
MS
D
0.001
P(FHD=1)
0.101
0.030
P(MS=1 | FHD=f)f
0.0400
0.0810
0.9001
0.9711
P(D=1 | FHD=f, MS=m)mf
28
Example: Car Diagnosis
29
MammoNet
30
ALARM
A Logical Alarm Reduction Mechanism• 8 diagnoses, 16 findings, …
31
Troup Detection
32
ARCO1: Forecasting Oil Prices
33
ARCO1: Forecasting Oil Prices
34
Forecasting Potato Production
35
Warning System
36
Extensions Find best values (posterior distr.) for
SEVERAL (> 1) “output” variables Partial specification of “input” values
only subset of variables only “distribution” of each input variable
General Variables Discrete, but domain > 2 Continuous (Gaussian: x = i bi yi for parents {Y} )
Decision Theory Decision Nets (Influence Diagrams) Making Decisions, not just assigning prob’s
Storing P( v | p1, p2,…,pk)General “CP Tables” 0(2k)Noisy-Or, Noisy-And, Noisy-Max“Decision Trees”
37
Outline Existing uses of Belief Nets (BNs) How to reason with BNs Specific Examples of BNs
Contrast with Rules, Neural Nets, …
Possible applications of BNs Challenges
How to reason efficiently How to learn BNs
38
Belief Nets vs Rules Both have “Locality”
Specific clusters (rules / connected nodes)
WHY?: Easier for people to reason CAUSALLYeven if use is DIAGNOSTIC
BN provide OPTIMAL way to deal with+ Uncertainty+ Vagueness (var not given, or only dist)+ Error
…Signals meeting Symbols …
BN permits different “direction”s of inference
Often same nodes (rep’ning Propositions) butBN: Cause Effect “Hep Jaundice” P(J | H )
Rule: Effect Cause“Jaundice Hep”
39
Belief Nets vs Neural Nets Both have “graph structure” but
So harder to Initialize NN Explain NN(But perhaps easier to learn NN from examples only?)
BNs can deal withPartial InformationDifferent “direction”s of inference
BN: Nodes have SEMANTICs Combination Rules: Sound Probability
NN: Nodes: arbitrary Combination Rules: Arbitrary
40
Belief Nets vs Markov Nets Each uses “graph structure”
to FACTOR a distribution… explicitly specify dependencies, implicitly
independencies…
but subtle differences…BNs capture “causality”, “hierarchies”MNs capture “temporality”
C
BATechnical: BNs use DIRECTRED arcs allow “induced dependencies”
I (A, {}, B) “A independent of B, given {}” ¬ I (A, C, B) “A dependent on B, given C”
MNs use UNDIRECTED arcs allow other independencies
I(A, BC, D) A independent of D, given B, CI(B, AD, C) B independent of C, given A, D D
CB
A
41
Uses of Belief Nets #1 Medical Diagnosis: “Assist/Critique” MD
identify diseases not ruled-out specify additional tests to perform suggest treatments appropriate/cost-effective react to MD’s proposed treatment
Decision Support: Find/repair faults in complex machines[Device, or Manufacturing Plant, or …]… based on sensors, recorded info, history,…
Preventative Maintenance: Anticipate problems in complex machines
[Device, or Manufacturing Plant, or …]…based on sensors, statistics, recorded info, device history,…
42
Uses (con’t)
Logistics Support: Stock warehouses appropriately…based on (estimated) freq. of needs, costs,
Diagnose Software:Find most probable bugs, given
program behavior, core dump, source code, … Part Inspection/Classification:
… based on multiple sensors, background, model of production,… Information Retrieval:
Combine information from various sources,based on info from various “agents”,…
General: Partial Info, Sensor fusion-Classification -Interpretation-Prediction -…
43
Challenge #1Computational Efficiency
For given BN:General problem is
Given
Compute
+ If BN is “poly tree”, efficient alg.
- If BN is gen’l DAG (>1 path from X to Y)
- NP-hard in theory- slow in practice
Tricks: Get approximate answer (quickly)+ Use abstraction of BN+ Use “abstraction” of query (range)
O1 = v1, …, On = vn
P(H | O1 = v1, …, On = vn)
DI
H
J
B
44
# 2a:Obtaining Accurate BN BN encodes distribution over n variables
Not O(2n) values, but “only” i 2k_i
(Node ni binary, with ki parents)
Still lots of values! …structure ..
Qualitative InformationStructure: “What depends on what?”
• Easy for people (background knowledge)• But NP-hard to learn from samples…
Quantitative InformationActual CP-tables
• Easy to learn, given lots of examples.• But people have hard time…
Knowledge acquisition: from human experts
Simple learning algorithm
45
Notes on Learning
Mixed Sources: Person provides structure;Algorithm fills-in numbers.
Just Learning Algorithm: algorithms that
learn from samplestructure values
Just Human Expert: People produce CP-table, as well as structure
Relatively few values really requiredEsp. if NoisyOr, NoisyAnd, NaiveBayes, …
Actual values not that important…Sensitivity studies
46
My Current Work Learning Belief Nets
Model selection:Challenging myth that MDL is appropriate
criteria Learning “performance system”, not
model Validating Belief Nets
“Error bars” around answers
Adaptive User Interfaces Efficient Vision Systems Foundations of Learnability
Learning Active Classifiers Sequential learners
Condition Based maintenance, Bio-signal interpretation, …
47
# 2b: Maintaining Accurate BN
The world changes.Information in BN*
may be perfect at time t sub-optimal at time t + 20 worthless at time t + 200
Need to MAINTAIN a BN over timeusing on-going human consultant
Adaptive BN Dirichlet distribution (variables) Priors over BNs
48
Conclusions Belief Nets are PROVEN TECHNOLOGY
Medical Diagnosis DSS for complex machines Forecasting, Modeling, InfoRetrieval…
Provide effective way toRepresent complicated, inter-related eventsReason about such situations
•Diagnosis, Explanation, ValueOfInfo•Explain conclusions•Mix Symbolic and Numeric observations
ChallengesEfficient ways to use BNsWays to create BNsWays to maintain BNsReason about time
49
Extra Slides
AI Seminar Friday, noon, CSC3-33 Free PIZZA! http://www.cs.ualberta.ca/~ai/ai-seminar.html
Referenceshttp://www.cs.ualberta.ca/~greiner/bn.html
Crusher Controller Formal Framework Decision Nets Developing the Model Why Reasoning is Hard Learning Accurate Belief Nets
50
References•http://www.cs.ualberta.ca/~greiner/bn.html• Overview textbooks:
Judea Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, 1988.
Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach, Prentice Hall, 1995. (See esp Ch 14, 15, 19.)
• General info re BayesNetshttp://www.afit.af.mil:80/Schools/EN/ENG/LABS/AI/BayesianNetworks Proceedings: http://www.sis.pitt.edu/~dsl/uai.htmlAssoc for Uncertainty in AI http://www.auai.org/
• Learning:David Heckerman, A tutorial on learning with Bayesian networks,
1995,http://www.research.microsoft.com/research/dtg/heckerma/TR-95-
06.htm
• Software:General: http://bayes.stat.washington.edu/almond/belief.htmlJavaBayes http://www.cs.cmu.edu/~fgcozman/Research/JavaBayeNorsys http://www.norsys.com/
51
Decision Net: Test/Buy a Car
52
Utility: Decision Nets Given c( action, state) R (cost function)
Cp(a) = Es[ c(a,s) ] = sS p(s | obs) * c(a, s) Best (immediate) action: a* = argmina A
{Cp(a) } Decision Net (like Belief Net) but…
3 types of nodes chance (like Belief net) action – repair, sensing cost/utility
Links for “dependency” Given observations, obs, computes best action, a*
Sequence of Actions: MDPs, POMDPs, …Go Back
53
Decision Net: Drill for Oil?
Go Back
54
Formal Framework
)|(),...,( 1 ii
in paxPxxP
)2( ||i
paiO
Always true:P(x1, …,xn) = P(x1) P(x2 | x1) P (x3 | x2, x1) … P (xn | xn-1,…,x1)
Given independencies,P(xk | x1,…,xk-1) = P (xk | pak) for some pak {x1, …, xk-1}
Hence
So just connect each y pai to xi… DAG structure
.Note: -Size of BN is so better to use small pai.
-pai = {1,…,i – 1} is never incorrect … but seldom min’l… (so hard to store, learn, reason with,…)- Order of variables can make HUGE difference Can have |pai| = 1 for one ordering
|pai| =i– 1 for anotherGo Back
55
Developing the ModelSource of information
+ (Human) Expert (s)
+ Data from earlier Runs
+ Simulator
Typical Process1. Develop / Refine Initial Prototype
2. Test Prototype ↦ Accurate System
3. Deploy System
4. Update / Maintain System
56
Develop/Refine PrototypeRequires expert
useful to have dataInitial Interview(s):
To establish “what relates to what”Expert time: ≈ ½ - day
Iterative process: (Gradual refinement)
To refine qualitative connectionsTo establish correct operationsExpert presents “Good Performance”
KE implements Expert’s claimsKE tests on examples (real data or expert), and reports to Expert
Expert time: ≈ 1 – 2 hours / week for ?? Weeks(Depends on complexity of device, and accuracy of model)
Go Back
57
Why Reasoning is HardBN reasoning may look easy:
Just “propagate” information from node to node
Z
BA
C Challenge: What is P(C=t)?A = Z = ¬B P ( A = t ) = P ( B = f ) = ½ So… ? P ( C = t ) = P ( A = t, B = t) = P ( A = t) * P( B = t) = ½ * ½ = ¼ Wrong: P ( C = t ) = 0 !
Need to maintain dependencies! P ( A = t, B = t ) = P ( A = t ) * P ( B = t | A = t)
z P(A=t|Z=z)
t 1.0
f 0.0
z P(B=t|Z=z)
t 0.0
f 1.0
a b P(C=t|a,b)
t t 1.0
t f 0.0
f t 0.0
f f 0.0
P(Z=t)
0.5
Go Back
58
Crusher Controller Given observations
History, sensor readings, schedule, … Specify best action for crusher
“stop immediately”, “increase roller speed by ”
Best == minimize expected cost …
Initially: just recommendation to human operator Later: Directly implement (some) actions
?Request values of other sensors?
59
Approach1. For each state s
(“Good flow”, “tooth about to enter”, …)
for each action a(“Stop immediately”, “Change p7 += 0.32”, …)
determine utility of performing a in s(Cost of lost production if stopped;… of reduced production efficient if continue; …)
2. Use observations to estimate (dist over) current states
Infer EXPECTED UTILITY of each action, based on distr.
3. Return action with highest Expected Utility
60
Details Inputs
Sensor Readings (history) Camera, microphone,
power-draw Parameter settings Log files, Maintenance
records Schedule (maintenance,
anticipated load, …) Outputs
Continue as is Adjust parameters
GapSize, ApronFeederSpeed, 1J_ConveyorSpeed
Shut down immediately Step adding new material Tell operator to look
State “CrusherEnvironment”
#UncrushableThingsNowInCrusher
#TeethMissing NextUncrushableEntry Control Parameters
61
Benefits Increase Crusher Effectiveness
Find best settings for parameters To maximize production of well-sized chunks
Reduce Down Time Know when maintain/repair is critical
Reduce Damage to Crusher Usable Model of Crusher
Easy to modify when needed Training Design of next generation
Prototype for design of {control, diagnostician} of other machines
Go Back
62
My Background PhD, Stanford (Computer Science)
Representational issues, Analogical Inference … everything in Logic
PostDoc at UofToronto (CS) Foundations of learnability, logical inference, DB, control
theory, … … everything in Logic
Industrial research (Siemens Corporate Research) Need to solve REAL problems
Theory Revision, Navigational systems, … …logic is not be-all-and-end-all!
Prof at UofAlberta (CS) Industrial problems (Siemens, BioTools, Syncrude) Foundations of learnability, probabilistic inference …
63
Less Trivial Situations• N.b., obs1 is not always independent of obs2 given H
• Eg, FamilyHistoryDepression ‘causes’ MotherSuicide and Depression
MotherSuicide causes Depression (w/ or w/o F.H.Depression)
• Here, P( D | MS, FHD ) P( D | FHD ) ! Can be done using Belief Network,
but need to specify:P( FHD ) 1P( MS | FHD ) 2P( D | MS, FHD ) 4
FHD
MS
D
0.001
P(FHD=1)
f P(MS=1 | FHD=f)
1 0.10
0 0.03
f m P(D=1 | FHD=f, MS=m)
1 1 0.97
1 0 0.90
0 1 0.08
0 0 0.04