ETHEM ALPAYDIN© The MIT Press, 2010
[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml2e
Lecture Slides for
1
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
2
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Probability and InferenceResult of tossing a coin is {Heads,Tails}
Random var X {1,0}
Bernoulli distribution
P (X = 1) = po
P (X = 0) = (1 ‒ po)
Sample: X = {xt }Nt =1
Estimation: po = # {Heads}/#{Tosses} = ∑t xt / N
Prediction of next toss:
Heads if po > ½, Tails otherwise
3Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
4
ClassificationCredit scoring: Inputs are income(x1) and savings (x2).
Output is low-risk vs high-risk Input: x = [x1, x2]T , Output: C {0, 1}
Prediction:
otherwise 0
)|0()|1( if 1 choose
lyequivalentor
otherwise 0
50)|1( if 1 choose
2121
21
C
C
C
C
,xxCP ,xxCP
. ,xxCP
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Bayes’ Rule
xx
xppP
PCC
C|
|
posterior
Likelihood : conditional probability
prior
Evidence : marginal probability
1|1|0
00|11|
110
xx
xxx
CC
CCCC
CC
Pp
PpPpp
PP
5Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Bayes’ Rule: K>2 Classes
K
kkk
ii
iii
CPCp
CPCp
pCPCp
CP
1
|
|
||
x
x
xx
x
xx | max |if choose
1 and 01
kkii
K
iii
CPCPC
CPCP
6Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Losses and RisksAction αi : the decision to assign the input to class
Ci
Loss of αi when the input actually belongs to Ck : λik
Expected risk (Duda and Hart, 1973) for taking action αi
xx
xx
|min|if choose
||1
kkii
k
K
kiki
RR
CPR
7Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT
Press (V1.0)
Losses and Risks: 0/1 LossIn the special case of the zero-one loss case:
then
ki
kiik if 1
if 0
For minimum risk, choose the most probable class.
1
1
| |
|
|
K
i ik kk
kk i
i
R P C
P C
P C
x x
x
x
8Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
9
Losses and Risks: RejectDefine an additional action of reject, :
10
otherwise 1
1if
if 0
,Ki
ki
ik
xxx
xx
|1||
||1
1
iik
ki
K
kkK
CPCPR
CPR
otherwise reject
1| and ||if choose xxx ikii CPikCPCPC
1K
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Discriminant FunctionsSetBayes’ classifier –
Use 0-1 loss function:
Ignoring the term, p(x):
xx kkii ggC maxif choose
xxx kkii gg max| RK decision regions R1,...,RK
|i i ig p C P Cx x
|i ig Rx x
|i ig P Cx x
10Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
K=2 ClassesExamples of K=2 :
g(x) = g1(x) – g2(x)
Log odds:
otherwise
0if choose
2
1
C
gC x
x
x||
log2
1
CPCP
1
1 22
|If | | then log 0
|
P CP C P C
P C
xx x
x
11Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT
Press (V1.0)
Utility TheoryCertain features may be costly to observe.To assess the value of information that additional features
may provide.
Prob of state Sk given evidence x : P (Sk|x)
Utility function of action αi when state is Sk : Uik
To measure how good it is to take action αi when the state is Sk
Expected utility
For example, in classification, decisions correspond to choosing one of the classes, and maximizing the expected utility is equivalent to minimizing expected risk.
| |
Choose if | max |
i ik kk
i i jj
EU U P S
α EU EU
x x
x x
12Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT
Press (V1.0)
Association RulesAssociation rule: X Y
People who buy/click/visit/enjoy X are also likely to buy/click/visit/enjoy Y.
A rule implies association, not necessarily causation (因果關係 ).
13Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Association measuresAssociation rule: X Y
Support (X Y): the statistical significance of the rule
Confidence (X Y): the conditional probability
Lift (X Y) (the interest of the association rule)
If X and Y are independent, then we expect lift to be close to 1. If the lift is more than 1, that X makes Y more likely, and if the lift is
less than 1, having X make Y less likely.
customers #
and bought whocustomers #,
YXYXP
, # customers who bought and
|( ) # customers who bought
P X Y X YP Y X
P X X
14Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT
Press (V1.0)
, ( | )( ) ( ) ( )
P X Y P Y XP X P Y P Y
Apriori algorithm (Agrawal et al., 1996)For example,
{X, Y, Z} is a 3-item set, and we may look for a rule, such as X, Y Z.All such rules have high enough support and confidence.Since a sales database is generally very large, it needs an efficient
algorithm (Apriori algorithm) to find these rules by doing a small number of passes over the database.
Apriori algorithmFinding frequent (時常發生的 ) itemsets which have enough support.
If {X,Y,Z} is frequent, then {X,Y}, {X,Z}, and {Y,Z} should be frequent. If {X,Y} is not frequent, none of its supersets can be frequent.
Converting them to rules with enough confidence Once we find the frequent k-item sets, we convert them to rules: X, Y
Z, ... and X Y, Z, ... For all possible single consequents (後項 ), we check if the rule has enough
confidence and remove it if it does not.Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 15
ExerciseIn a two-class, two-action problem, if the loss function is
, , and , write the optimal decision rule.
Show that as we move an item from the antecedent to the consequent, confidence can never increase:
confidence(ABCD) confidence(ABCD)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 16
02211 1012 121
17
Bayesian NetworksBayesian Networks: also known as graphical models, probabilistic networks
Nodes are hypotheses (random variables) and the probability corresponds to our belief in the truth of the hypothesis.
Arcs are direct influences between hypotheses.The structure is represented as a directed acyclic graph (DAG).The parameters are the conditional probabilities in the arcs.
(Pearl, 1988, 2000; Jensen, 1996; Lauritzen, 1996)
18Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Causes and Bayes’ Rule
6.0~ RP
Diagnostic inference:Knowing that the grass is wet, what is the probability that rain is the cause?
causaldiagnostic
75.06.02.04.09.0
4.09.0
~|~|
|
||
RPRWPRPRWP
RPRWP
WP
RPRWPWRP
19Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Conditional IndependenceX and Y are independent if
P(X,Y)=P(X)P(Y)X and Y are conditionally independent given Z if
P(X,Y|Z)=P(X|Z)P(Y|Z)
or P(X|Y,Z)=P(X|Z)
Three canonical (標準的 ) cases for conditional independence: Head-to-tail connectionTail-to-tail connectionHead-to-head connection
20Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Case 1: Head-to-Tail ConnectionP(X,Y,Z)=P(X)P(Y|X)P(Z|Y)
X and Z are independent given Y
21Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
P(W|C)= P(W|R)P(R|C)+P (W|~R)P(~R|C)= 0.9X0.8+0.2X0.2= 0.76
P(C|W)= P(W|C)P(C)/P(W)= 0.76X0.4/0.47= 0.65
P(R) = P(R|C)P(C)+P(R|~C)P(~C)
= 0.8X0.4+0.1X0.6 = 0.38
P(W) = P(W|R)P(R)+P(W|~R)P(~R)
= 0.9X0.38+0.2X0.62 = 0.47
Case 2: Tail-to-Tail ConnectionP(X,Y,Z)=P(X)P(Y|X)P(Z|X)
Y and Z are independent given X
22Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
P(C|R) = P(R|C)P(C)/P(R)= P(R|C)P(C)/(P(R|C)P(C)+ P(R|~C)P(~C)) = 0.8X0.5/(0.8X0.5+0.1X0.5) = 0.89
P(R|S) = P(R|C)P(C|S)+P(R|~C)P(~C|S) = 0.22 (Pages 391-392) 0.22 =P(R|S) < P(R) = 0.45
Case 3: Head-to-Head ConnectionP(X,Y,Z)=P(X)P(Y)P(Z|X,Y)
X and Y are independent
23Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
( | , ) ( , ) ( |~ , ) (~ , )
( | , ~ ) ( ,~ ) ( |~ ,~ ) (~ ,~ )
( | , ) ( ) ( ) ( |~ , ) (~ ) ( )
( | , ~ ) ( ) (~ ) ( |~ ,~ ) (~ ) (~ )
0.52
P W P W R S P R S P W R S P R S
P W R S P R S P W R S P R S
P W R S P R P S P W R S P R P S
P W R S P R P S P W R S P R P S
Causal InferenceCausal inference: If the sprinkler is on, what is the probability that the grass is wet?
P (W |S) = P (W |R, S) P (R |S) +
P (W |~R,S) P (~R |S) = P (W |R,S) P (R) +
P (W |~R, S) P (~R) = 0.95x0.4 + 0.9x0.6
= 0.92P(R) and P(S) are independent.
24Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Diagnostic InferenceDiagnostic inference: If the grass is wet, what is the probability that the sprinkler is on? P (S |W) = 0.35 > 0.2 ( =P (S)) where P (W) = 0.52 P (S |R,W) = 0.21 < 0.35
Explaining away: Knowing that it has raineddecreases the probability that the sprinkler is on. 25Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Diagnostic Inference
0.52
)~,(~)~,|~(
)~,()~,|(
),(~),|~(
),(),|(
SRPSRWP
SRPSRWP
SRPSRWP
SRPSRWPWP
( | , ) ( , )| ,
( , )
( | , ) ( | ) ( )
( | ) ( )
( | , ) ( | )
( | )
0.95 0.2 0.21
0.9
P W R S P R SP S R W
P R W
P W R S P S R P R
P W R P R
P W R S P S R
P W R
( | ) ( ) 0.92 0.2
| 0.35( ) 0.52
P W S P SP S W
P W
26Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT
Press (V1.0)
ExerciseP417 (3)
Calculate P(R|W), P(R|W,S), and P(R|W,~S).
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27
Bayesian Networks: CausesCausal inference:P (W |C) = P (W |R,S,C) P (R,S |C) +
P (W |~R,S,C) P (~R,S |C) +
P(W |R,~S,C) P (R,~S |C) +
P (W |~R,~S,C) P (~R,~S|C) = 0.76
use the fact that P (R, S |C) = P (R |C) P (S |C)
Diagnostic: P (C |W ) = ? (Exercise) 28Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Causal Inference
,
( | ) ( , , | )
= ( , , | ) ( ,~ , | ) ( , ,~ | ) ( ,~ ,~ | )
= ( | , , ) ( , | ) ( |~ , , ) (~ , | )
( | ,~ , ) ( ,~ | ) ( |~ ,~ , ) (~ ,~ | )
R S
P W C P W R S C
P W R S C P W R S C P W R S C P W R S C
P W R S C P R S C P W R S C P R S C
P W R S C P R S C P W R S C P R S C
( | , ) ( | ) ( | ) ( |~ , ) (~ | ) ( | )
( | ,~ ) ( | ) (~ | ) ( |~ ,~ ) (~ | ) (~ | )
0.95 0.8 0.1 0.90 0.2 0.1 0.90 0.8 0.9 0.1 0.2 0.9
P W R S P R C P S C P W R S P R C P S C
P W R S P R C P S C P W R S P R C P S C
0.076 0.018 0.648 0.018 0.76
29Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT
Press (V1.0)
( | ) ( )
( | ) ?( )
P W C P CP C W
P W
Bayesian Networks
Belief propagation (Pearl, 1988)use for inference when the network is a tree
Junction trees (Lauritzen and Spiegelhalter, 1988) convert a given directed acyclic graph to a tree
30Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
, , , | | | ,P C S R W P C P S C P R C P W S R
d
iiid XXPXXP
11 parents|,...,
Bayesian Networks: Classification
diagnosticP (C | x )
Bayes’ rule inverts the arc:
x
xx
pCPCp
CP|
|
31Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)