scoring functions for learning bayesian networks · (codeword) with length ˇ log p l. len(d: n) =...
TRANSCRIPT
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
Scoring Functions for Learning Bayesian Networks
Brandon Malone
Much of this material is adapted from Suzuki 1993, Lam and Bacchus 1994, and Heckerman 1998
Many of the images were taken from the Internet
February 13, 2014
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
Scoring Functions for Learning Bayesian Networks
Suppose we have two Bayesian network structure N1 and N2.
(C)
Rain?
Winter?
(A)
(E)
Slippery Road?
(D)
Wet Grass?
(B)
Sprinkler?
(C)
Rain?
Winter?
(A)
(E)
Slippery Road?
(D)
Wet Grass?
(B)
Sprinkler?
Which structure best explains a dataset D?
We will use scoring functions to rate each network.The one with the best score is “better.”
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
Scoring Functions for Learning Bayesian Networks
Suppose we have two Bayesian network structure N1 and N2.
(C)
Rain?
Winter?
(A)
(E)
Slippery Road?
(D)
Wet Grass?
(B)
Sprinkler?
(C)
Rain?
Winter?
(A)
(E)
Slippery Road?
(D)
Wet Grass?
(B)
Sprinkler?
Which structure best explains a dataset D?
We will use scoring functions to rate each network.The one with the best score is “better.”
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
1 Scoring Functions
2 Minimum Description Length (MDL)
3 Bayesian Dirichlet Score Family
4 Wrap-up
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
Why do we want to learn structures?
Knowledge discovery (“interpretation”)
Density estimation (“prediction”)
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
Assumptions (generally)
Multinomial samples
Complete data
Parameter independence
GlobalLocal
Parameter modularity
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
Under and overfitting
Underfitting, too simple Overfitting, too complex Tradeoff, “just right”
What does it mean in Bayesian networks?
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
Minimum description length (MDL)
MDL∗ views learning as data compression.
Traditionally, MDL consists of two components.
Model encoding
Data encoding, using the model
A few properties
Formalizes Occam’s Razor
Works regardless of a “true” model
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
Avoiding overfitting with MDL
Short model encoding
Long data encoding
Long model encoding
Short data encoding
Medium model encoding
Medium data encoding
We will favor models which do not use too many bits to encodeeither the model or the data.
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
Encoding a Bayesian network
We must encode:
Parents of each node
We need log2 n bits for each parent.
Conditional probability parameters
We need (ri − 1) · qi parameters for Xi .
We need log2 N2 bits per parameter.
The total complexity is as follows.
n∑i
log n · |PAi |+logN
2· (ri − 1) · qi
Other encodings are possible.
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
Encoding data with a Bayesian network
Each complete instantiation Dl is assigned a binary string(codeword) with length ≈ − log pl .We can approximate this value using the counts from the data.
pl = P(Dl |D,N )
=n∏i
θijk:l Chain rule of BNs
=n∏i
Nijk:l
Nij :lUsing MLE parameters
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
Encoding data with a Bayesian network
Each complete instantiation Dl is assigned a binary string(codeword) with length ≈ − log pl .
len(D : N ) =N∑l
len(Dl : N )
=N∑l
− logn∏i
Nijk:l
Nij :l
= −N∑l
logn∏i
ri∏k
qi∏j
Nijk:l
Nij :l
= −N∑l
n∑i
qi∑j
ri∑k
logNijk:l
Nij :l
= −n∑i
qi∑j
ri∑k
Nijk · logNijk
Nij
This is the log-likelihood, `, of the data using the MLE parameters.
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
MDL as a scoring function
As derived here, the MDL score for a network N given a dataset Dis as follws.
MDL(N : D) = −n∑i
qi∑j
ri∑k
Nijk logNijk
Nij
+ log n · |PAi |+log N
2· (ri − 1) · qi
As the dataset (N) grows, the log n × |PAi | term vanishes, so themost commonly used version of MDL is as follows.
MDL(N : D) = −n∑i
qi∑j
ri∑k
Nijk logNijk
Nij
+logN
2· (ri − 1) · qi
MDL(N : D) = −n∑i
`(Xi |PAi ) +logN
2· (ri − 1) · qi
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
Bayesian Dirichlet (BD) Score Family
Suppose we would like to maximize the joint probability of thedata D and network N .
N ∗ = arg maxN
P(D,N )
= arg maxN
P(D|N )P(N )
We again have two parts.
Evaluation of the model, P(N )
Evaluation of data given the model, P(D|N )
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
Data given the model
We need to evaluate P(D|N ). We derived P(xl |D,N ) forparameter estimation.
P(xl |D,N ) =n∏i
qi∏j
ri∏k
αijk:l + nijk:l∑k (αijk:l + nijk:l)
Because the samples are iid, we can evaluate P(D|N ) by takingthe product.
P(D|N ) =N∏l
n∏i
qi∏j
ri∏k
αijk:l + nijk:l∑k (αijk:l + nijk:l)
=n∏i
qi∏j
Γ(αij)
Γ(αij + nij)
ri∏k
Γ(αijk + nijk)
Γ(αijk)
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
Score probability, P(D,N )
We are interested in the joint probability of D and N .
P(D,N ) = P(N )P(D|N )
= P(N )n∏i
qi∏j
Γ(αij)
Γ(αij + nij)
ri∏k
Γ(αijk + nijk)
Γ(αijk)
This is called the BD scoring function.If we set αijk = 1, then it is called the K2 metric.
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
Some desirable equivalences
Say N1and N2 are Markov equivalent.
Prior probabilities and equivalence
P(N1) = P(N2)
Likelihood probabilities and equivalence
P(D|N1) = P(D|N2)
Score probabilities and equivalence
P(D,N1) = P(D,N2)
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
Some desirable equivalences
Say N1and N2 are Markov equivalent.
Prior probabilities and equivalence
P(N1) = P(N2)
Likelihood probabilities and equivalence Not guaranteed by BD
P(D|N1) = P(D|N2)
Score probabilities and equivalence
P(D,N1) = P(D,N2)
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
BDe and BDeu
We can restrict the hyperparameters to ensure likelihoodequivalence. This is BDe.
αijk = α · P(Xi = k,PAi = j |N )
Typically, uninformative hyperparameters are used. This is BDeu.
αijk =α
ri · qi
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
The BDeu scoring function
We can incorporate our assumptions to derive the BDeu scoringfunction.
P(D,N ) = P(N )P(D|N ) Rewrite using chain rule
= P(N )n∏i
qi∏j
Γ(αij )
Γ(αij + nij )
ri∏k
Γ(αijk + nijk )
Γ(αijk )Substitute probability of data
∝n∏i
qi∏j
Γ(αij )
Γ(αij + nij )
ri∏k
Γ(αijk + nijk )
Γ(αijk )Assume a uniform structure prior
∝n∏i
qi∏j
Γ( αqi
)
Γ( αqi
+ nij )
ri∏k
Γ( αri ·qi
+ nijk )
Γ( αri ·qi
)Replace the αs
BDeu(N : D, α) =n∑i
qi∑j
logΓ( α
qi)
Γ( αqi
+ nij )+
ri∑k
logΓ( α
ri ·qi+ nijk )
Γ( αri ·qi
)Work in log-space
BDeu(N : D, α) =n∑i
qi∑j
log Γ(α
qi)− log Γ(
α
qi+ nij )+ Remove divisions
ri∑k
log Γ(α
ri · qi+ nijk )− log Γ(
α
ri · qi)
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
Decomposability
Both MDL and BD are decomposable: a sum over terms whichinvolve only a variable and its parents.
MDL(N : D) = −n∑i
{`(Xi |PAi ) +
log N
2· (ri − 1) · qi
}
BDeu(N : D, α) =n∑i
qi∑j
log Γ(α
qi)− log Γ(
α
qi+ nij )+
ri∑k
log Γ(α
ri · qi+ nijk )− log Γ(
α
ri · qi)
}
What does it mean when we evaluate different structures?
(C)
Rain?
Winter?
(A)
(E)
Slippery Road?
(D)
Wet Grass?
(B)
Sprinkler?
(C)
Rain?
Winter?
(A)
(E)
Slippery Road?
(D)
Wet Grass?
(B)
Sprinkler?
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
Limitations of scoring functions
Parameter independence is violated if data is missing.
Experimental data is different that observational data.
(MDL) When do we use asymptotics?
(BD) How do we specify α and P(N )?
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
Recap
During this part of the course, we have discussed:
Overfitting
Minimum description length scoring function for BNs
BD family of scores for BNs
Brandon Malone Scoring Functions for Learning Bayesian Networks
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up
Next in probabilistic models
We will discuss two strategies for learning Bayesian networkstructures.
A greedy hill climbing algorithm which finds local optima
A dynamic programming algorithm which guarantees to findan optimal network
Brandon Malone Scoring Functions for Learning Bayesian Networks