1 exploiting parameter domain knowledge for learning in bayesian networks thesis committee: tom...
Post on 19-Dec-2015
216 views
TRANSCRIPT
1
Exploiting Parameter Domain Knowledge for
Learning in Bayesian Networks
Thesis Committee:
Tom Mitchell (Chair)
John Lafferty
Andrew Moore
Bharat Rao (Siemens Medical Solutions)
~ Thesis Defense ~
Stefan Niculescu
Carnegie Mellon University, July 2005
2
Domain Knowledge
• In real world, often data is too sparse to allow building of an accurate model
• Domain knowledge can help alleviate this problem
• Several types of domain knowledge:– Relevance of variables (feature selection) – Conditional Independences among variables– Parameter Domain Knowledge
3
Parameter Domain Knowledge
• In a Bayesian Network for a real world domain:– can have huge number of parameters– not enough data to estimate them accurately
• Parameter Domain Knowledge constraints: – reduce the space of feasible parameters– reduce the variance of parameter estimates
4
Parameter Domain Knowledge
Examples:
• DK: “If a person has a Family history of Heart Attack, Race and Pollution are not significant factors for the probability of getting a Heart Attack.”
• DK: “Two voxels in the brain may exhibit the same activation patterns during a cognitive task, but with different amplitudes.”
• DK: “Two countries may have different Heart Disease rates, but the relative proportion of Heart Attack to CHF is the same.”
• DK: “The aggregate probability of Adverbs in English is less than the aggregate probability of Verbs”.
5
Thesis
Standard methods for performing parameter estimation in Bayesian Networks can be naturally extended to
take advantage of parameter domain knowledge that can be provided by a domain expert. These new learning algorithms perform better (in terms of
probability density estimation) than existing ones.
6
Outline
• Motivation
Parameter Domain Knowledge Framework
• Simple Parameter Sharing
• Parameter Sharing in Hidden Process Models
• Types of Parameter Domain Knowledge
• Related Work
• Summary / Future Work
7
Parameter Domain Knowledge Framework~ Domain Knowledge Constraints ~
8
Parameter Domain Knowledge Framework~ Frequentist Approach, Complete Data ~
9
Parameter Domain Knowledge Framework~ Frequentist Approach, Complete Data ~
10
Parameter Domain Knowledge Framework~ Frequentist Approach, Incomplete Data ~
EM Algorithm. Repeat until convergence:
11
Parameter Domain Knowledge Framework~ Frequentist Approach, Incomplete Data ~
~ Discrete Variables ~
EM Algorithm. Repeat until convergence:
12
Parameter Domain Knowledge Framework~ Bayesian Approach ~
13
Parameter Domain Knowledge Framework~ Bayesian Approach ~
14
Parameter Domain Knowledge Framework~ Computing the Normalization Constant ~
15
Parameter Domain Knowledge Framework~ Computing the Normalization Constant ~
In H7:
ε = 0.5
H(2)
16
Outline
• Motivation
• Parameter Domain Knowledge Framework
Simple Parameter Sharing
• Parameter Sharing in Hidden Process Models
• Types of Parameter Domain Knowledge
• Related Work
• Summary / Future Work
17
Simple Parameter Sharing~ Maximum Likelihood Estimators ~
Theorem. The Maximum Likelihood parameters are given by:
j
jNN
i
)(1 N
i
i
Total:
)( ii Nki places
Cubical Die – cut symmetrically at each corner k1=6 k2=8
18
Simple Parameter Sharing~ Dependent Dirichlet Priors ~
19
Simple Parameter Sharing~ Variance Reduction in Parameter Estimates ~
20
Simple Parameter Sharing~ Experiments – Learning a Probability Distribution ~
• Synthetic Dataset:– Probability distribution over 50 values– 50 randomly generated parameters:
• 6 shared between 2 and 5 times to count as half• The rest “not shared” (shared exactly once)
– 1000 examples sampled from this distribution– Purpose:
• Domain Knowledge readily available• To be able to study the effect of training set size (up to 1000)• To be able to compare our estimated distribution to the true
distribution
• Models:– STBN ( Standard Bayesian Network )– PDKBN ( Bayesian Network with PDK )
21
Experimental Results
• The difference between PDKBN and STBN shrinks when the size of training set increases, but PDKBN is much better when training data is scarce.
• PDKBN performs better than STBN
– Largest difference: 0.05 (30 ex)
• On average, STBN needs 1.86 times more examples to catch up in KL !!!
• 40 (PDKBN) ~ 103 (STBN)
• 200 (PDKBN) ~ 516 (STBN)
• 650 (PDKBN) ~ >1000 (STBN)
22
Outline
• Motivation
• Parameter Domain Knowledge Framework
• Simple Parameter Sharing
Parameter Sharing in Hidden Process Models
• Types of Parameter Domain Knowledge
• Related Work
• Summary / Future Work
23
Hidden Process Models
One observation (trial):
N different trials:
All trials and all Processes have equal length T
24
Parameter Sharing in HPMs
• similar shape activity
• different amplitudes
),(~ 2)'(22)'(11' 21tt
vtt
vvt PcPcNX
Xv
)( 1vc
)( 2vc
25
Parameter Sharing in HPMs~ Maximum Likelihood Estimation ~
l’(P,C) quadratic in (P,C), but
• linear in P !
• linear in C !
26
Parameter Sharing in HPMs~ Maximum Likelihood Estimation ~
27
Starplus Dataset
• Trial:
– read sentence
– view picture
– answer whether sentence describes picture
• 40 trials – 32 time slices (2/sec)
– picture presented first in half of trials
– sentence first in the other half
• Three possible objects: star, dollar, plus
• Collected by Just et al.
• IDEA: model using HPMs with two processes:
– “Sentence” and “Picture”
– We assume a process starts when stimulus is presented
– Will use Shared HPMs where possible
28
It is true that the star is above the plus?
29
30
+
---
*
31
32
Parameter Sharing in HPMs~ Hierarchical Partitioning Algorithm ~
33
Parameter Sharing in HPMs~ Experiments ~
• We compare three models:– Based on Average (per trial) Likelihood– StHPM – Standard, per voxel HPM– ShHPM – One HPM for all voxels in an ROI (24 total)– HieHPM – Hierarchical HPM
• Effect of training set size (6 to 40) in CALC:– ShHPM biased here
• Better than StHPM at small sample size• Worse at 40 examples
– HieHPM – the best• It can represent both models• e106 times better data likelihood than StHPM at 40 examples • StHPM needs 2.9 times more examples to catch up
34
Parameter Sharing in HPMs~ Experiments ~
Performance over whole brain (40 examples):
– HieHPM – the best• e1792 times better data likelihood than StHPM • Better than StHPM in 23/24 ROIs• Better than ShHPM in 12/24 ROIs, equal in 11/24
– ShHPM – second best• e464 times better data likelihood than StHPM• Better than StHPM in 18/24 ROIs• It has bias, but makes sense to share whole ROIs not involved in the
cognitive task
35
Learned Voxel Clusters
• In the whole brain:
~ 300 clusters
~ 15 voxels / cluster
• In CALC:
~ 60 clusters
~ 5 voxels / cluster
36
Sentence Process in CALC
37
Outline
• Motivation
• Parameter Domain Knowledge Framework
• Simple Parameter Sharing
• Parameter Sharing in Hidden Process Models
Types of Parameter Domain Knowledge
• Related Work
• Summary / Future Work
38
Parameter Domain Knowledge Types
• DISCRETE:
• Known Parameter Values
• Parameter Sharing and Proportionality Constants – One Distribution
• Sum Sharing and Ratio Sharing – One Distribution
• Parameter Sharing and Hierarchical Sharing – Multiple Distributions
• Sum Sharing and Ratio Sharing – Multiple Distributions
• CONTINUOUS (Gaussian Distributions):
• Parameter Sharing and Proportionality Constants – One Distribution
• Parameter Sharing in Hidden Process Models
• INEQUALITY CONSTRAINTS:
• Between Sums of Parameters – One Distribution
• Upper Bounds on Sums of Parameters – One Distribution
39
Probability Ratio Sharing
21
1
)|(1 EnglishWordPc
)|(2 SpanishWordPc
11
1
12 2
2
1415
2425
•Want to model P(Word|Language)
•Two languages: English, Spanish
•Different sets of words
•Domain Knowledge:
•Word groups:
•About computers: computer, keyboard, monitor, etc
•Relative frequency of “computer” to “keyboard” same in both languages
•Aggregate mass can be different
T1 Computer Words
T2 Business Words
40
Probability Ratio Sharing
1
11p
1 1
12p
21p
kp1
22p kp2
41p 42p kp4
51p 52p kp5
DK: Parameters of a given color preserve their relative ratios across all distributions!
...
k
k
p
p
p
p
p
p
2
1
22
12
21
11 ...
41
Proportionality Constants for Gaussians
42
Inequalities between Sums of Parameters
In spoken language:• Each Adverb comes along with a Verb• Each Adjective comes with a Noun or Pronoun
Therefore it is reasonable to expect that:• The frequency of Adverbs is less than that of Verbs• The frequency of Adjectives is less than that of Nouns and Pronouns Equivalently:
In general, within the same distribution:
43
Outline
• Motivation
• Parameter Domain Knowledge Framework
• Simple Parameter Sharing
• Parameter Sharing in Hidden Process Models
• Types of Parameter Domain Knowledge
Related Work
• Summary / Future Work
44
Dirichlet Priors in a Bayes Net
Prior Belief
Spread
• The Domain Expert specifies an assignment of parameters.
– leaves room for some error (Variance).
• Several types:
– Standard– Dirichlet Tree Priors– Dependent Dirichlet
45
Markov Models
1tW 1tWtW... ...
... ...
46
Module NetworksIn a Module:
• Same parents
• Same CPTs
Image from “Learning Module Networks” by Eran Segal and Daphne Koller
47
Context Specific Independence
Alarm
Set Burglary
48
Limitations of Current Models
• Dirichlet priors – When the number of parameters is huge, specifying a
useful prior is difficult
– Unable to enforce even simple constraints: Need additional hyperparameters to enforce basic parameter
sharing, but no closed form MAP estimates can be computed !
– Dependent Dirichlet Priors are not conjugate priors• Our priors are dependent and also conjugate !!!
• Markov Models, Module Networks and CSI– Particular cases of our Parameter Sharing DK
– Do not allow sharing at parameter level of granularity
49
Outline
• Motivation
• Parameter Domain Knowledge Framework
• Simple Parameter Sharing
• Parameter Sharing in Hidden Process Models
• Types of Parameter Domain Knowledge
• Related Work
Summary / Future Work
50
Summary
• Parameter Related Domain Knowledge is needed when data is scarce
– Reduces the number of free parameters
– Reduces the variance in parameter estimates (illustrated on Simple Parameter Sharing)
• Developed unified Parameter Domain Knowledge Framework
– From both a frequentist and Bayesian point of view
– From both complete and incomplete data
• Developed efficient learning algorithms for several types of PDK:
– Closed form solutions for most of these types
– For both discrete and continuous variables
– For both equality and inequality constraints
– Particular cases of our parameter sharing framework:
• Markov Models, Module Nets, Context Specific Independence
• Developed method of automatically learning the domain knowledge (illustrated on HPMs)
• Experiments show the superiority of models using PDK
51
Future Work
• Interactions among different types of Parameter Domain Knowledge
• Incorporate Parameter Domain Knowledge in Structure Learning
• Hard vs. Soft constraints
• Parameter Domain Knowledge for learning Undirected Graphical Models
52
Questions ?
53
THE END