1 exploiting parameter domain knowledge for learning in bayesian networks thesis committee: tom...

1

Exploiting Parameter Domain Knowledge for

Learning in Bayesian Networks

Thesis Committee:

Tom Mitchell (Chair)

John Lafferty

Andrew Moore

Bharat Rao (Siemens Medical Solutions)

~ Thesis Defense ~

Stefan Niculescu

Carnegie Mellon University, July 2005

2

Domain Knowledge

• In real world, often data is too sparse to allow building of an accurate model

• Domain knowledge can help alleviate this problem

• Several types of domain knowledge:– Relevance of variables (feature selection) – Conditional Independences among variables– Parameter Domain Knowledge

3

Parameter Domain Knowledge

• In a Bayesian Network for a real world domain:– can have huge number of parameters– not enough data to estimate them accurately

• Parameter Domain Knowledge constraints: – reduce the space of feasible parameters– reduce the variance of parameter estimates

4

Parameter Domain Knowledge

Examples:

• DK: “If a person has a Family history of Heart Attack, Race and Pollution are not significant factors for the probability of getting a Heart Attack.”

• DK: “Two voxels in the brain may exhibit the same activation patterns during a cognitive task, but with different amplitudes.”

• DK: “Two countries may have different Heart Disease rates, but the relative proportion of Heart Attack to CHF is the same.”

• DK: “The aggregate probability of Adverbs in English is less than the aggregate probability of Verbs”.

5

Thesis

Standard methods for performing parameter estimation in Bayesian Networks can be naturally extended to

take advantage of parameter domain knowledge that can be provided by a domain expert. These new learning algorithms perform better (in terms of

probability density estimation) than existing ones.

6

Outline

• Motivation

Parameter Domain Knowledge Framework

• Simple Parameter Sharing

• Parameter Sharing in Hidden Process Models

• Types of Parameter Domain Knowledge

• Related Work

• Summary / Future Work

7

Parameter Domain Knowledge Framework~ Domain Knowledge Constraints ~

8

Parameter Domain Knowledge Framework~ Frequentist Approach, Complete Data ~

9

Parameter Domain Knowledge Framework~ Frequentist Approach, Complete Data ~

10

Parameter Domain Knowledge Framework~ Frequentist Approach, Incomplete Data ~

EM Algorithm. Repeat until convergence:

11

Parameter Domain Knowledge Framework~ Frequentist Approach, Incomplete Data ~

~ Discrete Variables ~

EM Algorithm. Repeat until convergence:

12

Parameter Domain Knowledge Framework~ Bayesian Approach ~

13

Parameter Domain Knowledge Framework~ Bayesian Approach ~

14

Parameter Domain Knowledge Framework~ Computing the Normalization Constant ~

15

Parameter Domain Knowledge Framework~ Computing the Normalization Constant ~

In H7:

ε = 0.5

H(2)

16

Outline

• Motivation

• Parameter Domain Knowledge Framework

Simple Parameter Sharing



• Related Work


17

Simple Parameter Sharing~ Maximum Likelihood Estimators ~

Theorem. The Maximum Likelihood parameters are given by:

j

jNN

i

)(1 N

i

i

Total:

)( ii Nki places

Cubical Die – cut symmetrically at each corner k1=6 k2=8

18

Simple Parameter Sharing~ Dependent Dirichlet Priors ~

19

Simple Parameter Sharing~ Variance Reduction in Parameter Estimates ~

20

Simple Parameter Sharing~ Experiments – Learning a Probability Distribution ~

• Synthetic Dataset:– Probability distribution over 50 values– 50 randomly generated parameters:

• 6 shared between 2 and 5 times to count as half• The rest “not shared” (shared exactly once)

– 1000 examples sampled from this distribution– Purpose:

• Domain Knowledge readily available• To be able to study the effect of training set size (up to 1000)• To be able to compare our estimated distribution to the true

distribution

• Models:– STBN ( Standard Bayesian Network )– PDKBN ( Bayesian Network with PDK )

21

Experimental Results

• The difference between PDKBN and STBN shrinks when the size of training set increases, but PDKBN is much better when training data is scarce.

• PDKBN performs better than STBN

– Largest difference: 0.05 (30 ex)

• On average, STBN needs 1.86 times more examples to catch up in KL !!!

• 40 (PDKBN) ~ 103 (STBN)

• 200 (PDKBN) ~ 516 (STBN)

• 650 (PDKBN) ~ >1000 (STBN)

22

Outline

• Motivation



Parameter Sharing in Hidden Process Models


• Related Work


23

Hidden Process Models

One observation (trial):

N different trials:

All trials and all Processes have equal length T

24

Parameter Sharing in HPMs

• similar shape activity

• different amplitudes

),(~ 2)'(22)'(11' 21tt

vtt

vvt PcPcNX

Xv

)( 1vc

)( 2vc

25

Parameter Sharing in HPMs~ Maximum Likelihood Estimation ~

l’(P,C) quadratic in (P,C), but

• linear in P !

• linear in C !

26

Parameter Sharing in HPMs~ Maximum Likelihood Estimation ~

27

Starplus Dataset

• Trial:

– read sentence

– view picture

– answer whether sentence describes picture

• 40 trials – 32 time slices (2/sec)

– picture presented first in half of trials

– sentence first in the other half

• Three possible objects: star, dollar, plus

• Collected by Just et al.

• IDEA: model using HPMs with two processes:

– “Sentence” and “Picture”

– We assume a process starts when stimulus is presented

– Will use Shared HPMs where possible

28

It is true that the star is above the plus?

30

+

---

*

32

Parameter Sharing in HPMs~ Hierarchical Partitioning Algorithm ~

33

Parameter Sharing in HPMs~ Experiments ~

• We compare three models:– Based on Average (per trial) Likelihood– StHPM – Standard, per voxel HPM– ShHPM – One HPM for all voxels in an ROI (24 total)– HieHPM – Hierarchical HPM

• Effect of training set size (6 to 40) in CALC:– ShHPM biased here

• Better than StHPM at small sample size• Worse at 40 examples

– HieHPM – the best• It can represent both models• e106 times better data likelihood than StHPM at 40 examples • StHPM needs 2.9 times more examples to catch up

34

Parameter Sharing in HPMs~ Experiments ~

Performance over whole brain (40 examples):

– HieHPM – the best• e1792 times better data likelihood than StHPM • Better than StHPM in 23/24 ROIs• Better than ShHPM in 12/24 ROIs, equal in 11/24

– ShHPM – second best• e464 times better data likelihood than StHPM• Better than StHPM in 18/24 ROIs• It has bias, but makes sense to share whole ROIs not involved in the

cognitive task

35

Learned Voxel Clusters

• In the whole brain:

~ 300 clusters

~ 15 voxels / cluster

• In CALC:

~ 60 clusters

~ 5 voxels / cluster

36

Sentence Process in CALC

37

Outline

• Motivation




Types of Parameter Domain Knowledge

• Related Work


38

Parameter Domain Knowledge Types

• DISCRETE:

• Known Parameter Values

• Parameter Sharing and Proportionality Constants – One Distribution

• Sum Sharing and Ratio Sharing – One Distribution

• Parameter Sharing and Hierarchical Sharing – Multiple Distributions

• Sum Sharing and Ratio Sharing – Multiple Distributions

• CONTINUOUS (Gaussian Distributions):

• Parameter Sharing and Proportionality Constants – One Distribution


• INEQUALITY CONSTRAINTS:

• Between Sums of Parameters – One Distribution

• Upper Bounds on Sums of Parameters – One Distribution

39

Probability Ratio Sharing

21

1

)|(1 EnglishWordPc

)|(2 SpanishWordPc

11

1

12 2

2

1415

2425

•Want to model P(Word|Language)

•Two languages: English, Spanish

•Different sets of words

•Domain Knowledge:

•Word groups:

•About computers: computer, keyboard, monitor, etc

•Relative frequency of “computer” to “keyboard” same in both languages

•Aggregate mass can be different

T1 Computer Words

T2 Business Words

40

Probability Ratio Sharing

1

11p

1 1

12p

21p

kp1

22p kp2

41p 42p kp4

51p 52p kp5

DK: Parameters of a given color preserve their relative ratios across all distributions!

...

k

k

p

p

p

p

p

p

2

1

22

12

21

11 ...

41

Proportionality Constants for Gaussians

42

Inequalities between Sums of Parameters

In spoken language:• Each Adverb comes along with a Verb• Each Adjective comes with a Noun or Pronoun

Therefore it is reasonable to expect that:• The frequency of Adverbs is less than that of Verbs• The frequency of Adjectives is less than that of Nouns and Pronouns Equivalently:

In general, within the same distribution:

43

Outline

• Motivation





Related Work


44

Dirichlet Priors in a Bayes Net

Prior Belief

Spread

• The Domain Expert specifies an assignment of parameters.

– leaves room for some error (Variance).

• Several types:

– Standard– Dirichlet Tree Priors– Dependent Dirichlet

45

Markov Models

1tW 1tWtW... ...

... ...

46

Module NetworksIn a Module:

• Same parents

• Same CPTs

Image from “Learning Module Networks” by Eran Segal and Daphne Koller

47

Context Specific Independence

Alarm

Set Burglary

48

Limitations of Current Models

• Dirichlet priors – When the number of parameters is huge, specifying a

useful prior is difficult

– Unable to enforce even simple constraints: Need additional hyperparameters to enforce basic parameter

sharing, but no closed form MAP estimates can be computed !

– Dependent Dirichlet Priors are not conjugate priors• Our priors are dependent and also conjugate !!!

• Markov Models, Module Networks and CSI– Particular cases of our Parameter Sharing DK

– Do not allow sharing at parameter level of granularity

49

Outline

• Motivation





• Related Work

Summary / Future Work

50

Summary

• Parameter Related Domain Knowledge is needed when data is scarce

– Reduces the number of free parameters

– Reduces the variance in parameter estimates (illustrated on Simple Parameter Sharing)

• Developed unified Parameter Domain Knowledge Framework

– From both a frequentist and Bayesian point of view

– From both complete and incomplete data

• Developed efficient learning algorithms for several types of PDK:

– Closed form solutions for most of these types

– For both discrete and continuous variables

– For both equality and inequality constraints

– Particular cases of our parameter sharing framework:

• Markov Models, Module Nets, Context Specific Independence

• Developed method of automatically learning the domain knowledge (illustrated on HPMs)

• Experiments show the superiority of models using PDK

51

Future Work

• Interactions among different types of Parameter Domain Knowledge

• Incorporate Parameter Domain Knowledge in Structure Learning

• Hard vs. Soft constraints

• Parameter Domain Knowledge for learning Undirected Graphical Models

52

Questions ?

53

THE END

1 exploiting parameter domain knowledge for learning in bayesian networks thesis committee: tom...

Documents

types of domain knowledge

parameter estimation

domain expert

simple parameter sharing

h2 slide

real world domain

bayesian approach

frequentist approach