cs540 introduction to artificial intelligence lecture...
TRANSCRIPT
Probability Distributions Bayesian Network Naive Bayes
CS540 Introduction to Artificial IntelligenceLecture 11
Young WuBased on lecture slides by Jerry Zhu, Yingyu Liang, and Charles
Dyer
July 5, 2021
Probability Distributions Bayesian Network Naive Bayes
Joint DistributionMotivation
The joint distribution of Xj and Xj 1 provides the probability ofXj � xj and Xj 1 � xj 1 occur at the same time.
P Xj � xj ,Xj 1 � xj 1
(
The marginal distribution of Xj can be found by summingover all possible values of Xj 1 .
P tXj � xju �¸xPXj1
P Xj � xj ,Xj 1 � x
(
Probability Distributions Bayesian Network Naive Bayes
Conditional DistributionMotivation
Suppose the joint distribution is given.
P Xj � xj ,Xj 1 � xj 1
(
The conditional distribution of Xj given Xj 1 � xj 1 is ratiobetween the joint distribution and the marginal distribution.
P Xj � xj |Xj 1 � xj 1
( � P Xj � xj ,Xj 1 � xj 1
(
P Xj 1 � xj 1
(
Probability Distributions Bayesian Network Naive Bayes
NotationMotivation
The notations for joint, marginal, and conditional distributionswill be shortened as the following.
P xj , xj 1
(,P txju ,P
xj |xj 1
(
When the context is not clear, for example whenxj � a, xj 1 � b with specific constants a, b, subscripts will beused under the probability sign.
PXj ,Xj1ta, bu ,PXj
tau ,PXj |Xj1ta|bu
Probability Distributions Bayesian Network Naive Bayes
Bayesian NetworkDefinition
A Bayesian network is a directed acyclic graph (DAG) and aset of conditional probability distributions.
Each vertex represents a feature Xj .
Each edge from Xj to Xj 1 represents that Xj directly influencesXj 1 .
No edge between Xj and Xj 1 implies independence orconditional independence between the two features.
Probability Distributions Bayesian Network Naive Bayes
Conditional IndependenceDefinition
Recall two events A,B are independent if:
P tA,Bu � P tAuP tBu or P tA|Bu � P tAu
In general, two events A,B are conditionally independent,conditional on event C if:
P tA,B|Cu � P tA|CuP tB|Cu or P tA|B,Cu � P tA|Cu
Probability Distributions Bayesian Network Naive Bayes
Causal ChainDefinition
For three events A,B,C , the configuration AÑ B Ñ C iscalled causal chain.
In this configuration, A is not independent of C , but A isconditionally independent of C given information about B.
Once B is observed, A and C are independent.
Probability Distributions Bayesian Network Naive Bayes
Common CauseDefinition
For three events A,B,C , the configuration AÐ B Ñ C iscalled common cause.
In this configuration, A is not independent of C , but A isconditionally independent of C given information about B.
Once B is observed, A and C are independent.
Probability Distributions Bayesian Network Naive Bayes
Common EffectDefinition
For three events A,B,C , the configuration AÑ B Ð C iscalled common effect.
In this configuration, A is independent of C , but A is notconditionally independent of C given information about B.
Once B is observed, A and C are not independent.
Probability Distributions Bayesian Network Naive Bayes
Storing DistributionDefinition
If there are m binary variables with k edges, there are 2m jointprobabilities to store.
There are significantly less conditional probabilities to store.For example, if each node has at most 2 parents, then thereare less than 4m conditional probabilities to store.
Given the conditional probabilities, the joint probabilities canbe recovered.
Probability Distributions Bayesian Network Naive Bayes
Conditional Probability Table DiagramDefinition
Probability Distributions Bayesian Network Naive Bayes
Conditional Probability Table ExampleDefinition
Probability Distributions Bayesian Network Naive Bayes
Conditional Probability Table Larger ExampleDefinition
Probability Distributions Bayesian Network Naive Bayes
Training Bayes NetDefinition
Training a Bayesian network given the DAG is estimating theconditional probabilities. Let P pXjq denote the parents of thevertex Xj , and p pXjq be realizations (possible values) ofP pXjq.
P txj |p pXjqu , p pXjq P P pXjq
It can be done by maximum likelihood estimation given atraining set.
P txj |p pXjqu �cxj ,ppXjqcppXjq
Probability Distributions Bayesian Network Naive Bayes
Bayes Net Training Example, Part IDefinition
Probability Distributions Bayesian Network Naive Bayes
Bayes Net Training Example, Part IIDefinition
Probability Distributions Bayesian Network Naive Bayes
Laplace SmoothingDefinition
Recall that the MLE estimation can incorporate Laplacesmoothing.
P txj |p pXjqu �cxj ,ppXjq � 1
cppXjq � |Xj |
Here, |Xj | is the number of possible values (number ofcategories) of Xj .
Laplace smoothing is considered regularization for Bayesiannetworks because it avoids overfitting the training data.
Probability Distributions Bayesian Network Naive Bayes
Bayes Net Inference 1Definition
Given the conditional probability table, the joint probabilitiescan be calculated using conditional independence.
P tx1, x2, ..., xmu �m¹j�1
P txj |x1, x2, ..., xj�1, xj�1, ..., xmu
�m¹j�1
P txj |p pXjqu
Probability Distributions Bayesian Network Naive Bayes
Bayes Net Inference 2Definition
Given the joint probabilities, all other marginal and conditionalprobabilities can be calculated using their definitions.
P xj |xj 1 , xj2 , ...
( � P xj , xj 1 , xj2 , ...
(
P xj 1 , xj2 , ...
(
P xj , xj 1 , xj2 , ...
( �¸
Xk :k�j ,j 1,j2,...
P tx1, x2, ..., xmu
P xj 1 , xj2 , ...
( �¸
Xk :k�j 1,j2,...
P tx1, x2, ..., xmu
Probability Distributions Bayesian Network Naive Bayes
Bayes Net Inference Example, Part IDefinition
Probability Distributions Bayesian Network Naive Bayes
Bayes Net Inference Example, Part IIDefinition
Probability Distributions Bayesian Network Naive Bayes
Bayes Net Inference Example, Part IIIDefinition
Probability Distributions Bayesian Network Naive Bayes
Bayesian NetworkAlgorithm
Input: instances: txiuni�1 and a directed acyclic graph suchthat feature Xj has parents P pXjq.Output: conditional probability tables (CPTs): P txj |p pXjqufor j � 1, 2, ...,m.
Compute the transition probabilities using counts and Laplacesmoothing.
P txj |p pXjqu �cxj ,ppXjq � 1
cppXjq � |Xj |
Probability Distributions Bayesian Network Naive Bayes
Network StructureDiscussion
Selecting from all possible structures (DAGs) is too difficult.
Usually, a Bayesian network is learned with a tree structure.
Choose the tree that maximizes the likelihood of the trainingdata.
Probability Distributions Bayesian Network Naive Bayes
Chow Liu AlgorithmDiscussion
Add an edge between features Xj and Xj 1 with edge weightequal to the information gain of Xj given Xj 1 for all pairs j , j 1.
Find the maximum spanning tree given these edges. Thespanning tree is used as the structure of the Bayesian network.
Probability Distributions Bayesian Network Naive Bayes
Aside: Prim’s AlgorithmDiscussion
To find the maximum spanning tree, start with an arbitraryvertex, a vertex set containing only this vertex, V , and anempty edge set, E .
Choose an edge with the maximum weight from a vertexv P V to a vertex v 1 R V and add v 1 to V , add an edge fromv to v 1 to E
Repeat this process until all vertices are in V . The treepV ,E q is the maximum spanning tree.
Probability Distributions Bayesian Network Naive Bayes
Aside: Prim’s Algorithm DiagramDiscussion
Probability Distributions Bayesian Network Naive Bayes
Classification ProblemDiscussion
Bayesian networks do not have a clear separation of the labelY and the features X1,X2, ...,Xm.
The Bayesian network with a tree structure and Y as the rootand X1,X2, ...,Xm as the leaves is called the Naive Bayesclassifier.
Bayes rules is used to compute P tY � y |X � xu, and theprediction y is y that maximizes the conditional probability.
yi � arg maxy
P tY � y |X � xiu
Probability Distributions Bayesian Network Naive Bayes
Naive Bayes DiagramDiscussion
Probability Distributions Bayesian Network Naive Bayes
Multinomial Naive BayesDiscussion
The implicit assumption for using the counts as the maximumlikelihood estimate is that the distribution of Xj |Y � y , or ingeneral, Xj |P pXjq � p pXjq has the multinomial distribution.
P tXj � x |Y � yu � px
px � cx ,ycy
Probability Distributions Bayesian Network Naive Bayes
Gaussian Naive BayesDiscussion
If the features are not categorical, continuous distributionscan be estimated using MLE as the conditional distribution.
Gaussian Naive Bayes is used if Xj |Y � y is assumed to havethe normal distribution.
limεÑ0
1
εP tx Xj ¤ x � ε|Y � yu � 1?
2πσpjqy
exp
����
�x � µ
pjqy
2
2�σpjqy
2
��
Probability Distributions Bayesian Network Naive Bayes
Gaussian Naive Bayes TrainingDiscussion
Training involves estimating µpjqy and σ
pjqy since they
completely determine the distribution of Xj |Y � y .
The maximum likelihood estimates of µpjqy and
�σpjqy
2are the
sample mean and variance of the feature j .
µpjqy � 1
ny
n
i�1
xij1tyi�yu, ny �n
i�1
1tyi�yu
�σpjqy
2� 1
ny
n
i�1
�xij � µ
pjqy
21tyi�yu
sometimes�σpjqy
2� 1
ny � 1
n
i�1
�xij � µ
pjqy
21tyi�yu
Probability Distributions Bayesian Network Naive Bayes
Gaussian Naive Bayes DiagramDiscussion
Probability Distributions Bayesian Network Naive Bayes
Tree Augmented Network AlgorithmDiscussion
It is also possible to create a Bayesian network with allfeatures X1,X2, ...,Xm connected to Y (Naive Bayes edges)and the features themselves form a network, usually a tree(MST edges).
Information gain is replaced by conditional information gain(conditional on Y ) when finding the maximum spanning tree.
This algorithm is called TAN: Tree Augmented Network.
Probability Distributions Bayesian Network Naive Bayes
Tree Augmented Network Algorithm DiagramDiscussion