august 30, 2012 math for my slides “review of fundamentals”. linear algebra...
Post on 15-Jul-2020
0 Views
Preview:
TRANSCRIPT
LINEAR ALGEBRATopics: matrix, vector, norms, products
• Vector:
‣ product:
‣ norm: (Euclidean)
•Matrix:
‣ product:
‣ norm: (Frobenius)
Review of fundamentals
Hugo LarochelleD
´
epartement d’informatique
Universit
´
e de Sherbrooke
hugo.larochelle@usherbrooke.ca
August 30, 2012
Abstract
Math for my slides “Review of fundamentals”.
Linear algebra
• x = [x1, . . . , xd]> =
2
64x1.
.
.
xd
3
75
• < x
(1),x
(2)>= x
(1)>x
(2) =Pd
i=1 x(1)i x
(2)i
• ||x|| =p< x
>x > =
pPi x
2i
• X =
2
64X1,1 . . . X1,w
.
.
.
.
.
.
.
.
.
Xh,1 . . . Xh,w
3
75
• (X(1)X
(2))i,j = X
(1)i,· X
(2)·,j =
Pk X
(1)i,kX
(2)k,j
• Ii,j =
⇢1 si i = j
0 si i 6= j
• Xi,j = 0 si i 6= j
• Xi,j = 0 si i < j
• (X>)i,j = Xj,i
• trace(X) =P
i Xi,i
• X
�1X = I
• det (X) =Q
i Xi,i
1
Review of fundamentals
Hugo LarochelleD
´
epartement d’informatique
Universit
´
e de Sherbrooke
hugo.larochelle@usherbrooke.ca
August 30, 2012
Abstract
Math for my slides “Review of fundamentals”.
Linear algebra
• x = [x1, . . . , xd]> =
2
64x1.
.
.
xd
3
75
• < x
(1),x
(2)>= x
(1)>x
(2) =Pd
i=1 x(1)i x
(2)i
• ||x|| =p< x
>x > =
pPi x
2i
• X =
2
64X1,1 . . . X1,w
.
.
.
.
.
.
.
.
.
Xh,1 . . . Xh,w
3
75
• (X(1)X
(2))i,j = X
(1)i,· X
(2)·,j =
Pk X
(1)i,kX
(2)k,j
• Ii,j =
⇢1 si i = j
0 si i 6= j
• Xi,j = 0 si i 6= j
• Xi,j = 0 si i < j
• (X>)i,j = Xj,i
• trace(X) =P
i Xi,i
• X
�1X = I
• det (X) =Q
i Xi,i
1
Review of fundamentals
Hugo LarochelleD
´
epartement d’informatique
Universit
´
e de Sherbrooke
hugo.larochelle@usherbrooke.ca
August 30, 2012
Abstract
Math for my slides “Review of fundamentals”.
Linear algebra
• x = [x1, . . . , xd]> =
2
64x1.
.
.
xd
3
75
• < x
(1),x
(2)>= x
(1)>x
(2) =Pd
i=1 x(1)i x
(2)i
• ||x|| =px
>x =
pPi x
2i
• X =
2
64X1,1 . . . X1,w
.
.
.
.
.
.
.
.
.
Xh,1 . . . Xh,w
3
75
• (X(1)X
(2))i,j = X
(1)i,· X
(2)·,j =
Pk X
(1)i,kX
(2)k,j
• Ii,j =
⇢1 si i = j
0 si i 6= j
• Xi,j = 0 si i 6= j
• Xi,j = 0 si i < j
• (X>)i,j = Xj,i
• trace(X) =P
i Xi,i
• X
�1X = I
• det (X) =Q
i Xi,i
1
Review of fundamentals
Hugo LarochelleD
´
epartement d’informatique
Universit
´
e de Sherbrooke
hugo.larochelle@usherbrooke.ca
August 30, 2012
Abstract
Math for my slides “Review of fundamentals”.
Linear algebra
• x = [x1, . . . , xd]> =
2
64x1.
.
.
xd
3
75
• < x
(1),x
(2)>= x
(1)>x
(2) =Pd
i=1 x(1)i x
(2)i
• ||x||2 =px
>x =
pPi x
2i
• X =
2
64X1,1 . . . X1,w
.
.
.
.
.
.
.
.
.
Xh,1 . . . Xh,w
3
75
• (X(1)X
(2))i,j = X
(1)i,· X
(2)·,j =
Pk X
(1)i,kX
(2)k,j
• ||X||F =p
trace(X>X) =
qPi X
2i,i
• Ii,j =
⇢1 si i = j
0 si i 6= j
• Xi,j = 0 si i 6= j
• Xi,j = 0 si i < j
• (X>)i,j = Xj,i
• trace(X) =P
i Xi,i
• X
�1X = I
1
Review of fundamentals
Hugo LarochelleD
´
epartement d’informatique
Universit
´
e de Sherbrooke
hugo.larochelle@usherbrooke.ca
August 30, 2012
Abstract
Math for my slides “Review of fundamentals”.
Linear algebra
• x = [x1, . . . , xd
]> =
2
64x1.
.
.
x
d
3
75
• < x
(1),x
(2)>= x
(1)>x
(2) =P
d
i=1 x(1)i
x
(2)i
• ||x||2 =px
>x =
pPi
x
2i
• X =
2
64X1,1 . . . X1,m
.
.
.
.
.
.
.
.
.
X
n,1 . . . X
n,m
3
75
• (X(1)X
(2))i,j
= X
(1)i,· X
(2)·,j =
Pk
X
(1)i,k
X
(2)k,j
• ||X||F
=p
trace(X>X) =
qPi
X
2i,i
• Ii,j
=
⇢1 si i = j
0 si i 6= j
• X
i,j
= 0 si i 6= j
• X
i,j
= 0 si i < j
• X
i,j
= X
j,i
(X> = X)
• trace(X) =P
i
X
i,i
• trace(X(1)X
(2)X
(3)) = trace(X(3)X
(1)X
(2))trace(X(2)X
(3)X
(1))
1
2
Review of fundamentals
Hugo LarochelleD
´
epartement d’informatique
Universit
´
e de Sherbrooke
hugo.larochelle@usherbrooke.ca
September 6, 2012
Abstract
Math for my slides “Review of fundamentals”.
Linear algebra
• x = [x1, . . . , xd]>=
2
64x1.
.
.
xd
3
75
• < x
(1),x(2) >= x
(1)>x
(2)=
Pdi=1 x
(1)i x
(2)i
• ||x||2 =
px
>x =
pPi x
2i
• X =
2
64X1,1 . . . X1,m
.
.
.
.
.
.
.
.
.
Xn,1 . . . Xn,m
3
75
• (X
(1)X
(2))i,j = X
(1)i,· X
(2)·,j =
Pk X
(1)i,kX
(2)k,j
• ||X||F =
ptrace(X
>X) =
qPi
Pj X
2i,j
• I Ii,j =
⇢1 if i = j0 if i 6= j
• X Xi,j = 0 if i 6= j
• X Xi,j = 0 if i < j
• X Xi,j = Xj,i (X
>= X)
• trace(X) =
Pi Xi,i
• trace(X
(1)X
(2)X
(3)) = trace(X
(3)X
(1)X
(2)) = trace(X
(2)X
(3)X
(1))
1
LINEAR ALGEBRATopics: special matrices• Identity matrix :
•Diagonal matrix :
• Lower triangular matrix :
• Symmetric matrix : (i.e. )
• Square matrix: matrix with same number of rows and columns
Review of fundamentals
Hugo LarochelleD
´
epartement d’informatique
Universit
´
e de Sherbrooke
hugo.larochelle@usherbrooke.ca
August 30, 2012
Abstract
Math for my slides “Review of fundamentals”.
Linear algebra
• x = [x1, . . . , xd]>=
2
64x1.
.
.
xd
3
75
• < x
(1),x(2) >= x
(1)>x
(2)=
Pdi=1 x
(1)i x
(2)i
• ||x||2 =
px
>x =
pPi x
2i
• X =
2
64X1,1 . . . X1,m
.
.
.
.
.
.
.
.
.
Xn,1 . . . Xn,m
3
75
• (X
(1)X
(2))i,j = X
(1)i,· X
(2)·,j =
Pk X
(1)i,kX
(2)k,j
• ||X||F =
ptrace(X
>X) =
qPi X
2i,i
• Ii,j =
⇢1 if i = j0 if i 6= j
• Xi,j = 0 if i 6= j
• Xi,j = 0 if i < j
• Xi,j = Xj,i (X
>= X)
• trace(X) =
Pi Xi,i
• trace(X
(1)X
(2)X
(3)) = trace(X
(3)X
(1)X
(2))trace(X
(2)X
(3)X
(1))
1
3
Review of fundamentals
Hugo LarochelleD
´
epartement d’informatique
Universit
´
e de Sherbrooke
hugo.larochelle@usherbrooke.ca
August 31, 2012
Abstract
Math for my slides “Review of fundamentals”.
Linear algebra
• x = [x1, . . . , xd]>=
2
64x1.
.
.
xd
3
75
• < x
(1),x(2) >= x
(1)>x
(2)=
Pdi=1 x
(1)i x
(2)i
• ||x||2 =
px
>x =
pPi x
2i
• X =
2
64X1,1 . . . X1,m
.
.
.
.
.
.
.
.
.
Xn,1 . . . Xn,m
3
75
• (X
(1)X
(2))i,j = X
(1)i,· X
(2)·,j =
Pk X
(1)i,kX
(2)k,j
• ||X||F =
ptrace(X
>X) =
qPi X
2i,i
• I Ii,j =
⇢1 if i = j0 if i 6= j
• X Xi,j = 0 if i 6= j
• X Xi,j = 0 if i < j
• X Xi,j = Xj,i (X
>= X)
• trace(X) =
Pi Xi,i
• trace(X
(1)X
(2)X
(3)) = trace(X
(3)X
(1)X
(2)) = trace(X
(2)X
(3)X
(1))
1
Review of fundamentals
Hugo LarochelleD
´
epartement d’informatique
Universit
´
e de Sherbrooke
hugo.larochelle@usherbrooke.ca
August 31, 2012
Abstract
Math for my slides “Review of fundamentals”.
Linear algebra
• x = [x1, . . . , xd]>=
2
64x1.
.
.
xd
3
75
• < x
(1),x(2) >= x
(1)>x
(2)=
Pdi=1 x
(1)i x
(2)i
• ||x||2 =
px
>x =
pPi x
2i
• X =
2
64X1,1 . . . X1,m
.
.
.
.
.
.
.
.
.
Xn,1 . . . Xn,m
3
75
• (X
(1)X
(2))i,j = X
(1)i,· X
(2)·,j =
Pk X
(1)i,kX
(2)k,j
• ||X||F =
ptrace(X
>X) =
qPi X
2i,i
• I Ii,j =
⇢1 if i = j0 if i 6= j
• X Xi,j = 0 if i 6= j
• X Xi,j = 0 if i < j
• X Xi,j = Xj,i (X
>= X)
• trace(X) =
Pi Xi,i
• trace(X
(1)X
(2)X
(3)) = trace(X
(3)X
(1)X
(2)) = trace(X
(2)X
(3)X
(1))
1
Review of fundamentals
Hugo LarochelleD
´
epartement d’informatique
Universit
´
e de Sherbrooke
hugo.larochelle@usherbrooke.ca
August 31, 2012
Abstract
Math for my slides “Review of fundamentals”.
Linear algebra
• x = [x1, . . . , xd]>=
2
64x1.
.
.
xd
3
75
• < x
(1),x(2) >= x
(1)>x
(2)=
Pdi=1 x
(1)i x
(2)i
• ||x||2 =
px
>x =
pPi x
2i
• X =
2
64X1,1 . . . X1,m
.
.
.
.
.
.
.
.
.
Xn,1 . . . Xn,m
3
75
• (X
(1)X
(2))i,j = X
(1)i,· X
(2)·,j =
Pk X
(1)i,kX
(2)k,j
• ||X||F =
ptrace(X
>X) =
qPi X
2i,i
• I Ii,j =
⇢1 if i = j0 if i 6= j
• X Xi,j = 0 if i 6= j
• X Xi,j = 0 if i < j
• X Xi,j = Xj,i (X
>= X)
• trace(X) =
Pi Xi,i
• trace(X
(1)X
(2)X
(3)) = trace(X
(3)X
(1)X
(2)) = trace(X
(2)X
(3)X
(1))
1
Review of fundamentals
Hugo LarochelleD
´
epartement d’informatique
Universit
´
e de Sherbrooke
hugo.larochelle@usherbrooke.ca
August 31, 2012
Abstract
Math for my slides “Review of fundamentals”.
Linear algebra
• x = [x1, . . . , xd]>=
2
64x1.
.
.
xd
3
75
• < x
(1),x(2) >= x
(1)>x
(2)=
Pdi=1 x
(1)i x
(2)i
• ||x||2 =
px
>x =
pPi x
2i
• X =
2
64X1,1 . . . X1,m
.
.
.
.
.
.
.
.
.
Xn,1 . . . Xn,m
3
75
• (X
(1)X
(2))i,j = X
(1)i,· X
(2)·,j =
Pk X
(1)i,kX
(2)k,j
• ||X||F =
ptrace(X
>X) =
qPi X
2i,i
• I Ii,j =
⇢1 if i = j0 if i 6= j
• X Xi,j = 0 if i 6= j
• X Xi,j = 0 if i < j
• X Xi,j = Xj,i (X
>= X)
• trace(X) =
Pi Xi,i
• trace(X
(1)X
(2)X
(3)) = trace(X
(3)X
(1)X
(2)) = trace(X
(2)X
(3)X
(1))
1
Review of fundamentals
Hugo LarochelleD
´
epartement d’informatique
Universit
´
e de Sherbrooke
hugo.larochelle@usherbrooke.ca
August 31, 2012
Abstract
Math for my slides “Review of fundamentals”.
Linear algebra
• x = [x1, . . . , xd]>=
2
64x1.
.
.
xd
3
75
• < x
(1),x(2) >= x
(1)>x
(2)=
Pdi=1 x
(1)i x
(2)i
• ||x||2 =
px
>x =
pPi x
2i
• X =
2
64X1,1 . . . X1,m
.
.
.
.
.
.
.
.
.
Xn,1 . . . Xn,m
3
75
• (X
(1)X
(2))i,j = X
(1)i,· X
(2)·,j =
Pk X
(1)i,kX
(2)k,j
• ||X||F =
ptrace(X
>X) =
qPi X
2i,i
• I Ii,j =
⇢1 if i = j0 if i 6= j
• X Xi,j = 0 if i 6= j
• X Xi,j = 0 if i < j
• X Xi,j = Xj,i (X
>= X)
• trace(X) =
Pi Xi,i
• trace(X
(1)X
(2)X
(3)) = trace(X
(3)X
(1)X
(2)) = trace(X
(2)X
(3)X
(1))
1
Review of fundamentals
Hugo LarochelleD
´
epartement d’informatique
Universit
´
e de Sherbrooke
hugo.larochelle@usherbrooke.ca
August 31, 2012
Abstract
Math for my slides “Review of fundamentals”.
Linear algebra
• x = [x1, . . . , xd]>=
2
64x1.
.
.
xd
3
75
• < x
(1),x(2) >= x
(1)>x
(2)=
Pdi=1 x
(1)i x
(2)i
• ||x||2 =
px
>x =
pPi x
2i
• X =
2
64X1,1 . . . X1,m
.
.
.
.
.
.
.
.
.
Xn,1 . . . Xn,m
3
75
• (X
(1)X
(2))i,j = X
(1)i,· X
(2)·,j =
Pk X
(1)i,kX
(2)k,j
• ||X||F =
ptrace(X
>X) =
qPi X
2i,i
• I Ii,j =
⇢1 if i = j0 if i 6= j
• X Xi,j = 0 if i 6= j
• X Xi,j = 0 if i < j
• X Xi,j = Xj,i (X
>= X)
• trace(X) =
Pi Xi,i
• trace(X
(1)X
(2)X
(3)) = trace(X
(3)X
(1)X
(2)) = trace(X
(2)X
(3)X
(1))
1
Review of fundamentals
Hugo LarochelleD
´
epartement d’informatique
Universit
´
e de Sherbrooke
hugo.larochelle@usherbrooke.ca
August 31, 2012
Abstract
Math for my slides “Review of fundamentals”.
Linear algebra
• x = [x1, . . . , xd]>=
2
64x1.
.
.
xd
3
75
• < x
(1),x(2) >= x
(1)>x
(2)=
Pdi=1 x
(1)i x
(2)i
• ||x||2 =
px
>x =
pPi x
2i
• X =
2
64X1,1 . . . X1,m
.
.
.
.
.
.
.
.
.
Xn,1 . . . Xn,m
3
75
• (X
(1)X
(2))i,j = X
(1)i,· X
(2)·,j =
Pk X
(1)i,kX
(2)k,j
• ||X||F =
ptrace(X
>X) =
qPi X
2i,i
• I Ii,j =
⇢1 if i = j0 if i 6= j
• X Xi,j = 0 if i 6= j
• X Xi,j = 0 if i < j
• X Xi,j = Xj,i (X
>= X)
• trace(X) =
Pi Xi,i
• trace(X
(1)X
(2)X
(3)) = trace(X
(3)X
(1)X
(2)) = trace(X
(2)X
(3)X
(1))
1
Review of fundamentals
Hugo LarochelleD
´
epartement d’informatique
Universit
´
e de Sherbrooke
hugo.larochelle@usherbrooke.ca
August 31, 2012
Abstract
Math for my slides “Review of fundamentals”.
Linear algebra
• x = [x1, . . . , xd]>=
2
64x1.
.
.
xd
3
75
• < x
(1),x(2) >= x
(1)>x
(2)=
Pdi=1 x
(1)i x
(2)i
• ||x||2 =
px
>x =
pPi x
2i
• X =
2
64X1,1 . . . X1,m
.
.
.
.
.
.
.
.
.
Xn,1 . . . Xn,m
3
75
• (X
(1)X
(2))i,j = X
(1)i,· X
(2)·,j =
Pk X
(1)i,kX
(2)k,j
• ||X||F =
ptrace(X
>X) =
qPi X
2i,i
• I Ii,j =
⇢1 if i = j0 if i 6= j
• X Xi,j = 0 if i 6= j
• X Xi,j = 0 if i < j
• X Xi,j = Xj,i (X
>= X)
• trace(X) =
Pi Xi,i
• trace(X
(1)X
(2)X
(3)) = trace(X
(3)X
(1)X
(2)) = trace(X
(2)X
(3)X
(1))
1
LINEAR ALGEBRATopics: operations on matrices• Trace of matrix:‣ trace of products:
• Inverse of matrix:‣ doesn’t exist if determinant is 0
‣ inverse of product:
• Transpose of matrix:‣ transpose of product:
Review of fundamentals
Hugo LarochelleD
´
epartement d’informatique
Universit
´
e de Sherbrooke
hugo.larochelle@usherbrooke.ca
August 30, 2012
Abstract
Math for my slides “Review of fundamentals”.
Linear algebra
• x = [x1, . . . , xd]>=
2
64x1.
.
.
xd
3
75
• < x
(1),x(2) >= x
(1)>x
(2)=
Pdi=1 x
(1)i x
(2)i
• ||x||2 =
px
>x =
pPi x
2i
• X =
2
64X1,1 . . . X1,m
.
.
.
.
.
.
.
.
.
Xn,1 . . . Xn,m
3
75
• (X
(1)X
(2))i,j = X
(1)i,· X
(2)·,j =
Pk X
(1)i,kX
(2)k,j
• ||X||F =
ptrace(X
>X) =
qPi X
2i,i
• Ii,j =
⇢1 if i = j0 if i 6= j
• Xi,j = 0 if i 6= j
• Xi,j = 0 if i < j
• Xi,j = Xj,i (X
>= X)
• trace(X) =
Pi Xi,i
• trace(X
(1)X
(2)X
(3)) = trace(X
(3)X
(1)X
(2)) = trace(X
(2)X
(3)X
(1))
1
• X
�1X = I
• (X
>)i,j = Xj,i
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0
• �
• {x(t)}
• 9w, t⇤ x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rh | 9w x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui et u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
2
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0
• �
• {x(t)}
• 9w, t⇤ x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rh | 9w x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui et u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0
• �
• {x(t)}
• 9w, t⇤ x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rh | 9w x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui et u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0
• �
• {x(t)}
• 9w, t⇤ x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rh | 9w x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui et u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
4
Review of fundamentals
Hugo LarochelleD
´
epartement d’informatique
Universit
´
e de Sherbrooke
hugo.larochelle@usherbrooke.ca
August 31, 2012
Abstract
Math for my slides “Review of fundamentals”.
Linear algebra
• x = [x1, . . . , xd]>=
2
64x1.
.
.
xd
3
75
• < x
(1),x(2) >= x
(1)>x
(2)=
Pdi=1 x
(1)i x
(2)i
• ||x||2 =
px
>x =
pPi x
2i
• X =
2
64X1,1 . . . X1,m
.
.
.
.
.
.
.
.
.
Xn,1 . . . Xn,m
3
75
• (X
(1)X
(2))i,j = X
(1)i,· X
(2)·,j =
Pk X
(1)i,kX
(2)k,j
• ||X||F =
ptrace(X
>X) =
qPi X
2i,i
• I Ii,j =
⇢1 if i = j0 if i 6= j
• X Xi,j = 0 if i 6= j
• X Xi,j = 0 if i < j
• X Xi,j = Xj,i (X
>= X)
• trace(X) =
Pi Xi,i
• trace(X
(1)X
(2)X
(3)) = trace(X
(3)X
(1)X
(2)) = trace(X
(2)X
(3)X
(1))
1
LINEAR ALGEBRATopics: operations on matrices•Determinant‣ of triangular matrix:
‣ of transpose of matrix:
‣ of inverse of matrix:
‣ of product of matrix:
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0
• �
• {x(t)}
• 9w, t⇤ x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rh | 9w x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui et u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0
• �
• {x(t)}
• 9w, t⇤ x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rh | 9w x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui et u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0
• �
• {x(t)}
• 9w, t⇤ x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rh | 9w x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui et u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0
• �
• {x(t)}
• 9w, t⇤ x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rh | 9w x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui et u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
5
LINEAR ALGEBRATopics: properties of matrices•Orthogonal matrix:
• Positive definite matrix: ‣ if « » , then positive semi-definite
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0
• �
• {x(t)}
• 9w, t⇤ x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rh | 9w x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui et u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0 8v 2 R
• �
• {x(t)}
• 9w, t⇤ x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rh | 9w x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui et u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0 8v 2 R
• �
• {x(t)}
• 9w, t⇤ x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rh | 9w x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui et u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
6
LINEAR ALGEBRATopics: linear dependence, rank, range and nullspace• Set of linearly dependent vectors :
• Rank of matrix: number of linear independent columns• Range of a matrix:
•Nullspace of a matrix:
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0 8v 2 R
• �
• {x(t)}
• 9w, t⇤ x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rh | 9w x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui et u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0 8v 2 R
• �
• {x(t)}
• 9w, t⇤ such that x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rh | 9w x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui et u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0 8v 2 R
• �
• {x(t)}
• 9w, t⇤ such that x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rn | 9w such that x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui et u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0 8v 2 R
• �
• {x(t)}
• 9w, t⇤ such that x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rn | 9w such that x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui et u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
7
LINEAR ALGEBRATopics: eigenvalues and eigenvectors of a matrix• Eigenvalues and eigenvectors
• Properties‣ can write
‣ determinant of any matrix:
‣ positive definite if
‣ rank of matrix is the number of non-zero eigenvalues
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0 8v 2 R
• �
• {x(t)}
• 9w, t⇤ such that x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rn | 9w such that x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui and u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0 8v 2 R
• �
• {x(t)}
• 9w, t⇤ such that x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rn | 9w such that x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui and u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0 8v 2 R
• �
• {x(t)}
• 9w, t⇤ such that x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rn | 9w such that x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui and u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0 8v 2 R
• �
• {x(t)}
• 9w, t⇤ such that x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rn | 9w such that x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui and u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0 8i
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
8
DIFFERENTIAL CALCULUSTopics: derivative, partial derivative•Derivative:
‣ direction and rate of increase of function
• Partial derivative:
‣ direction and rate of increase for variable assuming others are fixed
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0 8v 2 R
• �
• {x(t)}
• 9w, t⇤ such that x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rn | 9w such that x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui and u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0 8i
•d
dxf(x) = lim
�!0
f(x+�)� f(x)
�
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0 8v 2 R
• �
• {x(t)}
• 9w, t⇤ such that x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rn | 9w such that x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui and u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0 8i
•d
dxf(x) = lim
�!0
f(x+�)� f(x)
�
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0 8v 2 R
• �
• {x(t)}
• 9w, t⇤ such that x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rn | 9w such that x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui and u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0 8i
•d
dxf(x) = lim
�!0
f(x+�)� f(x)
�
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rf(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
9
DIFFERENTIAL CALCULUSTopics: derivative, partial derivative• Example:
Dérivée partielle et gradient
● Exemple%de%fonc7on%à%deux%variables:%
%● Dérivées%par7elles:%%%
IFT615% Hugo%Larochelle% 39%
traite%comme%une%constante%
traite%comme%une%constante%
f(x, y) =x2
y
�f(x, y)�x
=2x
y
y x
�f(x, y)�y
=�x2
y2
•X
�1X
=I
•(X
>)i,j =X
j,i
•det(
X)=
QiX
i,i
•det(
X
>)=det(
X)
•det(
X
�1)=
det(X) �1
•det(
X
(1)X
(2))=det(
X
(1))det(X
(2))
•X
>=
X
�1
•v >
Xv>
0
•�
•{x
(t)}
•9w,t ⇤
x
(t ⇤)=P
t6=t ⇤w
tx
(t)
•R(X)=
{x2R
h|9w
x=
Pjw
jX
·,j }
•{x2R
h|x/2R
(X)}
•{�
i,u
i |Xu
i =�
iu
ietu >i
u
j =1i=
j }
•X
=P
i�
iu
iu >i
•det(
X)=
Qi�
i
•�
i>
0
•@
@x
f(x,y)=
lim�!
0
f(x+�,y)�
f(x,y)
�
•@
@y
f(x,y)=
lim�!
0
f(x,y+�)�
f(x,y)
�
Machinelearning
•S
up
erv
ised
learn
in
gex
am
ple:(x,y)
•T
rain
in
gset:D
train=
{(x
t,yt }
•
2
•X
�1X
=I
•(X
>)i,j =X
j,i
•det(
X)=
QiX
i,i
•det(
X
>)=det(
X)
•det(
X
�1)=
det(X) �1
•det(
X
(1)X
(2))=det(
X
(1))det(X
(2))
•X
>=
X
�1
•v >
Xv>
0
•�
•{x
(t)}
•9w,t ⇤
x
(t ⇤)=P
t6=t ⇤w
tx
(t)
•R(X)=
{x2R
h|9w
x=
Pjw
jX
·,j }
•{x2R
h|x/2R
(X)}
•{�
i,u
i |Xu
i =�
iu
ietu >i
u
j =1i=
j }
•X
=P
i�
iu
iu >i
•det(
X)=
Qi�
i
•�
i>
0
•@
@x
f(x,y)=
lim�!
0
f(x+�,y)�
f(x,y)
�
•@
@y
f(x,y)=
lim�!
0
f(x,y+�)�
f(x,y)
�
Machinelearning
•S
up
erv
ised
learn
in
gex
am
ple:(x,y)
•T
rain
in
gset:D
train=
{(x
t,yt }
•
2
treatas constant
treatas constant
• X
�1X = I
• (X>)i,j = Xj,i
• det (X) =Q
i Xi,i
• det (X>) = det (X)
• det (X�1) = det (X)�1
• det (X(1)X
(2)) = det (X(1)) det (X(2))
• X
> = X
�1
• v
>Xv > 0
• �
• {x(t)}
• 9w, t
⇤x
(t⇤) =P
t 6=t⇤ wtx(t)
• R(X) = {x 2 Rh | 9w x =P
j wjX·,j}
• {x 2 Rh | x /2 R(X)}
• {�i,ui | Xui = �iui et u
>i uj = 1i=j}
• X =P
i �iuiu>i
• det (X) =Q
i �i
• �i > 0
•@
@x
f(x, y) = lim�!0
f(x+�, y)� f(x, y)
�
•@
@y
f(x, y) = lim�!0
f(x, y +�)� f(x, y)
�
• x y
Machine learning
• Supervised learning example: (x, y)
• Training set: Dtrain = {(xt, yt}
•
2
• X
�1X = I
• (X>)i,j = Xj,i
• det (X) =Q
i Xi,i
• det (X>) = det (X)
• det (X�1) = det (X)�1
• det (X(1)X
(2)) = det (X(1)) det (X(2))
• X
> = X
�1
• v
>Xv > 0
• �
• {x(t)}
• 9w, t
⇤x
(t⇤) =P
t 6=t⇤ wtx(t)
• R(X) = {x 2 Rh | 9w x =P
j wjX·,j}
• {x 2 Rh | x /2 R(X)}
• {�i,ui | Xui = �iui et u
>i uj = 1i=j}
• X =P
i �iuiu>i
• det (X) =Q
i �i
• �i > 0
•@
@x
f(x, y) = lim�!0
f(x+�, y)� f(x, y)
�
•@
@y
f(x, y) = lim�!0
f(x, y +�)� f(x, y)
�
• x y
Machine learning
• Supervised learning example: (x, y)
• Training set: Dtrain = {(xt, yt}
•
2
10
DIFFERENTIAL CALCULUSTopics: gradient• Gradient:
Descente de gradient
● Le%gradient%donne%la%direc7on%(vecteur)%ayant%le%taux%d’accroissement%de%la%fonc7on%le%plus%élevé%
IFT615% Hugo%Larochelle% 47%
x y
f(x, y)
• X
�1X = X X
�1= I
• (X
(1)X
(2))
�1= X
(2)�1X
(1)�1
• (X
>)i,j = Xj,i
• (X
(1)X
(2))
>= X
(2)>X
(1)>
• det (X) =
Qi Xi,i
• det (X
>) = det (X)
• det (X
�1) = det (X)
�1
• det (X
(1)X
(2)) = det (X
(1)) det (X
(2))
• X
>= X
�1
• v
>Xv > 0 8v 2 R
• �
• {x(t)}
• 9w, t⇤ such that x
(t⇤)=
Pt 6=t⇤ wtx
(t)
• R(X) = {x 2 Rn | 9w such that x =
Pj wjX·,j}
• {x 2 Rn | x /2 R(X)}
• {�i,ui | Xui = �iui and u
>i uj = 1i=j}
• X =
Pi �iuiu
>i
• det (X) =
Qi �i
• �i > 0 8i
•d
dxf(x) = lim
�!0
f(x+�)� f(x)
�
•@
@xf(x, y) = lim
�!0
f(x+�, y)� f(x, y)
�
•@
@yf(x, y) = lim
�!0
f(x, y +�)� f(x, y)
�
• x y
• rx
f(x) =h
@@x1
f(x), . . . , @@x
d
f(x)i>
=
2
64
@@x1
f(x).
.
.
@@x
d
f(x)
3
75
2
11
DIFFERENTIAL CALCULUSTopics: Jacobian, Hessian• Hessian:
• If is a vector, the Jacobian is:
• rX
f(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2x
f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rx
f(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(o, s) =P
x p(x, o, s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• X1 X2 X3
3
12
• rX
f(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2x
f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rx
f(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(o, s) =P
x p(x, o, s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• X1 X2 X3
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
DIFFERENTIAL CALCULUSTopics: gradient for matrices• If scalar function takes a matrix as input
• For functions that output functions and take matrices as input, we organize into 3D tensors
13
• rX
f(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2x
f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rx
f(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(o, s) =P
x p(x, o, s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• X1 X2 X3
3
• r2x
f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rx
f(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• f(X) X
• rX
f(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• (⌦,F , P ), ⌦ F P ,
• ⌦
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(o, s) =P
x p(x, o, s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
3
• r2x
f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rx
f(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• f(X) X
• rX
f(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• (⌦,F , P ), ⌦ F P ,
• ⌦
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(o, s) =P
x p(x, o, s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
3
PROBABILITYTopics: probability space• Probability space: triplet‣ is the space of possible outcomes
‣ is the space of possible events
‣ is a probability measure mapping an outcome to its probability [0,1]
‣ example: throwing a die-
- (i.e. die is either 1 or 5)
-
• Properties:
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
1. 2.14
PROBABILITYTopics: random variable• Random variable: a function on outcomes• Examples: ‣ is the value of the outcome
‣ is 1 if the outcome is 1, 3 or 5, otherwise it’s 0
‣ is 1 if the outcome is smaller than 4, otherwise it’s 0
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
15
PROBABILITYTopics: distributions (joint, marginal, conditional)• Joint distribution: ( for short)‣ the probability of a complete assignment of all random variables
‣ example:
•Marginal distribution:‣ the probability of a partial assignment
‣ example:
• Conditional distribution:‣ the probability of some variables, assuming an assignment of other variables
‣ example:
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(o, s) =P
x p(x, o, s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
16
PROBABILITYTopics: probability chain rule, Bayes rule• Probability chain rule:‣ in general:
• Bayes rule:
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(O = o, S = s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• p(x1, x2) = p(x1)p(x2)
3
17
PROBABILITYTopics: independence between variables• Independence: variables and are independent if
• Conditional independence: variables and are independent given if
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(o, s) =P
x p(x, o, s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• X1 X2 X3
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(o, s) =P
x p(x, o, s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• X1 X2 X3
3
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
or
or
or
or
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(o, s) =P
x p(x, o, s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• X1 X2 X3
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(o, s) =P
x p(x, o, s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• X1 X2 X3
3
• rf(X) =
2
664
@@X1,1
f(X) . . . @@X1,m
f(X)
.
.
. . . ..
.
.
@@X
n,1f(X) . . . @
@Xn,m
f(X)
3
775
• r2f(x) =
2
664
@2
@x21f(x) . . . @2
@x1@xd
f(x)
.
.
. . . ..
.
.
@2
@xd
@x1f(x) . . . @2
@x2d
f(X)
3
775
• f(x) = [f(x)1, . . . , f(x)k]>
• rf(x) =
2
64
@@x1
f(x)1 . . . @@x
d
f(x)1.
.
. . . ..
.
.
@@x1
f(x)k . . . @@x
d
f(x)k
3
75
• (⌦,F , P ), ⌦ F P ,
• e 2 F
• ⌦ = {1, 2, 3, 4, 5, 6}
• e = {1, 5} 2 F
• P ({1, 5}) = 26
• P ({!}) �= 0 8! 2 ⌦
•P
!2⌦ P ({!}) = 1
• X O S
• p(X = x,O = o, S = s) p(x, s, o)
• p(X = 1, O = 1, S = 0) = 0
• p(o, s) =P
x p(x, o, s)
• p(O = 1, S = 0) =
16
• p(S = s|O = o)
• p(S = 1|O = 1) =
23
• p(s, o) = p(s|o)p(o) = p(o|s)p(s)
• p(x) =Q
i p(xi|x1, . . . , xi�1)
• p(O = o|S = s) = p(S=s|O=o)p(O=o)Po
0 p(S=s|O=o0)p(O=o0)
• X1 X2 X3
3
18
PROBABILITYTopics: expectation, variance• Expectation:‣ properties:-
-
- if independent,
• Variance:‣ properties:-
- if independent,
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
19
PROBABILITYTopics: covariance matrix• Covariance:
‣ if independent
‣
• Covariance matrix:
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
20
PROBABILITYTopics: continuous variables• for continuous variable , is a density function‣
‣ the probability is zero for continuous variables
‣ in previous equations, summations would be replaced by integrals
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
21
PROBABILITYTopics: Bernoulli, Gaussian distributions• Bernoulli variable: ‣
‣
‣
‣
• Gaussian variable:
‣
‣
‣
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) =
1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�E[X] =
µ Cov[X] = ⌃
4
22
PROBABILITYTopics: Multivariate Gaussian distributions• Gaussian variable:
‣
‣
‣
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) = 1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�
• E[X] = µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) = 1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�
• E[X] = µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) = 1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�
• E[X] = µ Cov[X] = ⌃
4
• p(x1, x2) = p(x1)p(x2)
• p(x1|x2) = p(x1)
• p(x2|x1) = p(x2)
• p(x1, x2|x3) = p(x1|x3)p(x2|x3)
• p(x1|x2, x3) = p(x1|x3)
• p(x2|x1, x3) = p(x2|x3)
• E[X] =
Px x p(X = x)
• E[X + Y ] = E[X] + E[Y ]
• E[XY ] = E[X]E[Y ]
• E[f(X)] =
Px f(x) p(X = x)
• Var[X] =
Px(x� E(X))
2 p(X = x)
• Var[X] = E[X2]� E[X]
2
• Var[X + Y ] = Var[X] + Var[Y ]
•
Cov(X1, X2) = E[(X1 � E[X1])(X2 � E[X2])]
=
X
x1
X
x2
(x1 � E[X1])(x2 � E[X2]) p(x1, x2)
• Cov(X1, X2) = 0
• Var(X) = Cov(X,X)
• Cov(X) =
2
64Cov(X1, X1) . . . Cov(X1, Xd)
.
.
.
.
.
.
.
.
.
Cov(Xd, X1) . . . Cov(Xd, Xd)
3
75
• P (X 2 A) =
Rx2A
p(x)dx
• P (X = x)
• X 2 {0, 1} p(X = 1) = µ p(X = 0) = 1�µ E[X] = µ Var[X] = µ(1�µ)
• X 2 R p(x) = 1p2⇡�2
exp
⇣� (x�µ)2
2�2
⌘E[X] = µ Var[X] = �2
• X 2 Rd p(x) = 1p(2⇡)d det(⌃)
exp
�� 1
2 (x� µ)>⌃�1(x� µ)
�
• E[X] = µ Cov[X] = ⌃
4
23
STATISTICSTopics: estimate of the expectation and covariance matrix• Sample mean:
• Sample variance:
• Sample covariance matrix:
• These estimators are unbiased, i.e.:
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[
b⌃] = ⌃
• bµ µ �2
T
• µ 2 [�1.96 b� + bµ, bµ+ 1.96 b�]
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y)
• Training set: Dtrain= {(x(t), y(t)}
•
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[
b⌃] = ⌃
• bµ µ �2
T
• µ 2 [�1.96 b� + bµ, bµ+ 1.96 b�]
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y)
• Training set: Dtrain= {(x(t), y(t)}
•
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[
b⌃] = ⌃
• bµ µ �2
T
• µ 2 [�1.96 b� + bµ, bµ+ 1.96 b�]
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y)
• Training set: Dtrain= {(x(t), y(t)}
•
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ µ �2
T
• µ 2 [�1.96 b� + bµ, bµ+ 1.96 b�]
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y)
• Training set: Dtrain= {(x(t), y(t)}
•
5
24
STATISTICSTopics: confidence interval• Confidence interval of the sample mean (1D):‣ if T is big, the following estimator is approx. Gaussian with mean 0 and variance 1
‣ then we have that, with 95% probability, that
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 [�1.96 b� + bµ, bµ+ 1.96 b�]
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y)
• Training set: Dtrain= {(x(t), y(t)}
•
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y)
• Training set: Dtrain= {(x(t), y(t)}
•
5
25
STATISTICSTopics: maximum likelihood, I.I.D. hypothesis•maximum likelihood estimator (MLE):
‣ the sample mean is the MLE for a Gaussian distribution
‣ the sample covariance matrix isnt, but this is
• Independent and identically distributed variables
• b�2=
1T�1
Pt(x
(t) � µ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[
b⌃] = ⌃
• bµ µ �2
T
• µ 2 [�1.96 b� + bµ, bµ+ 1.96 b�]
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
• p(x(1), . . . ,x(T )) =
Qt p(x
(t))
Machine learning
• Supervised learning example: (x, y)
• Training set: Dtrain= {(x(t), y(t)}
•
5
• b�2=
1T�1
Pt(x
(t) � µ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[
b⌃] = ⌃
• bµ µ �2
T
• µ 2 [�1.96 b� + bµ, bµ+ 1.96 b�]
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
Machine learning
• Supervised learning example: (x, y)
• Training set: Dtrain= {(x(t), y(t)}
•
5
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[
b⌃] = ⌃
• bµ µ �2
T
• µ 2 [�1.96 b� + bµ, bµ+ 1.96 b�]
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y)
• Training set: Dtrain= {(x(t), y(t)}
•
5
26
SAMPLINGTopics: Monte Carlo estimate•Monte Carlo estimate:‣ a method to approximate an expensive expectation
‣ the must be sampled from
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[f(X)] =
Px
f(x) p(x) ⇡ 1K
Pk f(x
(k))
• x
(k) p(x)
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[f(X)] =
Px
f(x) p(x) ⇡ 1K
Pk f(x
(k))
• x
(k) p(x)
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[f(X)] =
Px
f(x) p(x) ⇡ 1K
Pk f(x
(k))
• x
(k) p(x)
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
5
27
SAMPLINGTopics: importance sampling• Importance sampling:‣ a sampling method for when is expensive to sample from
‣ is easier to sample from and should be as similar as possible to - designing a good is often hard to do
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[f(X)] =
Px
f(x) p(x) ⇡ 1K
Pk f(x
(k))
• x
(k) p(x)
• E[f(X)] =
Px
f(x) p(x)q(x)q(x) ⇡
1K
Pk f(x
(k))
p(x)q(x)
• q(x)
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[f(X)] =
Px
f(x) p(x) ⇡ 1K
Pk f(x
(k))
• x
(k) p(x)
• E[f(X)] =
Px
f(x) p(x)q(x)q(x) ⇡
1K
Pk f(x
(k))
p(x)q(x)
• q(x)
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[f(X)] =
Px
f(x) p(x) ⇡ 1K
Pk f(x
(k))
• x
(k) p(x)
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[f(X)] =
Px
f(x) p(x) ⇡ 1K
Pk f(x
(k))
• x
(k) p(x)
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[f(X)] =
Px
f(x) p(x) ⇡ 1K
Pk f(x
(k))
• x
(k) p(x)
• E[f(X)] =
Px
f(x) p(x)q(x)q(x) ⇡
1K
Pk f(x
(k))
p(x)q(x)
• q(x)
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
5
28
SAMPLINGTopics: Markov Chain Monte Carlo (MCMC)•MCMC:‣ iterative method to generate the sequence of
‣ the set of will be dependent of each other
‣ is a transition operator, that must satisfy certain properties
‣ K must be big for the set of samples be representative of distribution
‣ usually, we drop the first samples, which are not reliable
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[f(X)] =
Px
f(x) p(x) ⇡ 1K
Pk f(x
(k))
• x
(k) p(x)
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[f(X)] =
Px
f(x) p(x) ⇡ 1K
Pk f(x
(k))
• x
(k) p(x)
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[f(X)] =
Px
f(x) p(x) ⇡ 1K
Pk f(x
(k))
• x
(k) p(x)
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[f(X)] =
Px
f(x) p(x) ⇡ 1K
Pk f(x
(k))
• x
(k) p(x)
• E[f(X)] =
Px
f(x) p(x)q(x)q(x) ⇡
1K
Pk f(x
(k))
p(x)q(x)
• q(x)
• x
(1) T (x0 x)�! x
(2) T (x0 x)�! x
(3) T (x0 x)�! · · · T (x0 x)�! x
(K)
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Pt r✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[f(X)] =
Px
f(x) p(x) ⇡ 1K
Pk f(x
(k))
• x
(k) p(x)
• E[f(X)] =
Px
f(x) p(x)q(x)q(x) ⇡
1K
Pk f(x
(k))
p(x)q(x)
• q(x)
• x
(1) T (x0 x)�! x
(2) T (x0 x)�! x
(3) T (x0 x)�! · · · T (x0 x)�! x
(K)
• T (x0 x)
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
5
29
SAMPLINGTopics: Gibbs sampling• Gibbs sampling:‣ MCMC method which uses the following transition operator- pick a variable
- obtain by only resampling this variable according to
- return
‣ often, we simply cycle through the variables, in random order
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[f(X)] =
Px
f(x) p(x) ⇡ 1K
Pk f(x
(k))
• x
(k) p(x)
• E[f(X)] =
Px
f(x) p(x)q(x)q(x) ⇡
1K
Pk f(x
(k))
p(x)q(x)
• q(x)
• x
(1) T (x0 x)�! x
(2) T (x0 x)�! x
(3) T (x0 x)�! · · · T (x0 x)�! x
(K)
• T (x0 x)
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[f(X)] =
Px
f(x) p(x) ⇡ 1K
Pk f(x
(k))
• x
(k) p(x)
• E[f(X)] =
Px
f(x) p(x)q(x)q(x) ⇡
1K
Pk f(x
(k))
p(x)q(x)
• q(x)
• x
(1) T (x0 x)�! x
(2) T (x0 x)�! x
(3) T (x0 x)�! · · · T (x0 x)�! x
(K)
• T (x0 x)
• xi x
0
• p(xi|x1, . . . , xi�1, xi+1, . . . , xd)
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[f(X)] =
Px
f(x) p(x) ⇡ 1K
Pk f(x
(k))
• x
(k) p(x)
• E[f(X)] =
Px
f(x) p(x)q(x)q(x) ⇡
1K
Pk f(x
(k))
p(x)q(x)
• q(x)
• x
(1) T (x0 x)�! x
(2) T (x0 x)�! x
(3) T (x0 x)�! · · · T (x0 x)�! x
(K)
• T (x0 x)
• xi x
0
• p(xi|x1, . . . , xi�1, xi+1, . . . , xd)
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[f(X)] =
Px
f(x) p(x) ⇡ 1K
Pk f(x
(k))
• x
(k) p(x)
• E[f(X)] =
Px
f(x) p(x)q(x)q(x) ⇡
1K
Pk f(x
(k))
p(x)q(x)
• q(x)
• x
(1) T (x0 x)�! x
(2) T (x0 x)�! x
(3) T (x0 x)�! · · · T (x0 x)�! x
(K)
• T (x0 x)
• xi x
0
• p(xi|x1, . . . , xi�1, xi+1, . . . , xd)
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[f(X)] =
Px
f(x) p(x) ⇡ 1K
Pk f(x
(k))
• x
(k) p(x)
• E[f(X)] =
Px
f(x) p(x)q(x)q(x) ⇡
1K
Pk f(x
(k))
p(x)q(x)
• q(x)
• x
(1) T (x0 x)�! x
(2) T (x0 x)�! x
(3) T (x0 x)�! · · · T (x0 x)�! x
(K)
• T (x0 x)
• xi x
0
• p(xi|x1, . . . , xi�1, xi+1, . . . , xd)
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
5
30
MACHINE LEARNINGTopics: supervised learning• Learning example: • Task to solve: predict target from input‣ classification: target is a class ID (from 0 to nb. of class - 1)
‣ regression: target is a real number
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y)
• Training set: Dtrain= {(x(t), y(t)}
•
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t)}
•
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t)}
•
5
31
MACHINE LEARNINGTopics: unsupervised learning• Learning example: •No explicit target to predict‣ clustering: partition data into groups
‣ feature extraction: learn meaningful features automatically
‣ dimensionality reduction: learning a lower-dimensional representation of input
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t)}
•
5
32
MACHINE LEARNINGTopics: learning algorithm, model, training set• Learning algorithm ‣ takes as input a training set
‣ outputs a model
•We then say the model was trained on ‣ the model has learned the information present in
•We can now use the model on new inputs
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
•
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
5
33
MACHINE LEARNINGTopics: training, validation and test sets, generalization• Training set serves to train a model• Validation set serves to select hyper-parameters• Test set serves to estimate the generalization
performance (error)
• Generalization is the behavior of the model on unseen examples‣ this is what we care about in machine learning!
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
5
34
MACHINE LEARNINGTopics: capacity of a model, underfitting, overfitting, hyper-parameter, model selection• Capacity: flexibility of a model • Hyper-parameter: a parameter of a model that is not trained
(specified before training)• Underfitting: state of model which could improve
generalization with more training or capacity•Overfitting: state of model which could improve generalization
with more training or capacity•Model selection: process of choosing the best hyper-
parameters on validation set35
MACHINE LEARNINGTopics: capacity of a model, underfitting, overfitting, hyper-parameter, model selection
0
0.1
0.2
0.3
0.4
0.5
Training Validation
underfitting overfitting
training time or capacity
36
MACHINE LEARNINGTopics: interaction between training set size/capacity/training time and training error/generalization error• If capacity increases:‣ training error will ?
‣ generalization error will ?
• If training time increases:‣ training error will ?
‣ generalization error will ?
• If training set size increases:‣ generalization error will ?
‣ difference between the training and generalization error will ?37
MACHINE LEARNINGTopics: interaction between training set size/capacity/training time and training error/generalization error• If capacity increases:‣ training error will ?
‣ generalization error will ?
• If training time increases:‣ training error will ?
‣ generalization error will ?
• If training set size increases:‣ generalization error will ?
‣ difference between the training and generalization error will ?
decrease
37
MACHINE LEARNINGTopics: interaction between training set size/capacity/training time and training error/generalization error• If capacity increases:‣ training error will ?
‣ generalization error will ?
• If training time increases:‣ training error will ?
‣ generalization error will ?
• If training set size increases:‣ generalization error will ?
‣ difference between the training and generalization error will ?
decrease
increase or decrease
37
MACHINE LEARNINGTopics: interaction between training set size/capacity/training time and training error/generalization error• If capacity increases:‣ training error will ?
‣ generalization error will ?
• If training time increases:‣ training error will ?
‣ generalization error will ?
• If training set size increases:‣ generalization error will ?
‣ difference between the training and generalization error will ?
decrease
increase or decrease
decrease
37
MACHINE LEARNINGTopics: interaction between training set size/capacity/training time and training error/generalization error• If capacity increases:‣ training error will ?
‣ generalization error will ?
• If training time increases:‣ training error will ?
‣ generalization error will ?
• If training set size increases:‣ generalization error will ?
‣ difference between the training and generalization error will ?
decrease
increase or decrease
decrease
increase or decrease
37
MACHINE LEARNINGTopics: interaction between training set size/capacity/training time and training error/generalization error• If capacity increases:‣ training error will ?
‣ generalization error will ?
• If training time increases:‣ training error will ?
‣ generalization error will ?
• If training set size increases:‣ generalization error will ?
‣ difference between the training and generalization error will ?
decrease
increase or decrease
decrease
increase or decrease
decrease (or maybe stay the same)
37
MACHINE LEARNINGTopics: interaction between training set size/capacity/training time and training error/generalization error• If capacity increases:‣ training error will ?
‣ generalization error will ?
• If training time increases:‣ training error will ?
‣ generalization error will ?
• If training set size increases:‣ generalization error will ?
‣ difference between the training and generalization error will ?
decrease
increase or decrease
decrease
increase or decrease
decrease (or maybe stay the same)
decrease37
MACHINE LEARNINGTopics: empirical risk minimization, regularization• Empirical risk minimization‣ framework to design learning algorithms
‣ is a loss function
‣ is a regularizer (penalizes certain values of )
• Learning is cast as optimization‣ ideally, we’d optimize classification error, but it’s not smooth
‣ loss function is a surrogate for what we truly should optimize (e.g. upper bound)
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
5
38
MACHINE LEARNINGTopics: gradient descent• Gradient descent: procedure to minimize a function‣ compute gradient
‣ take step in opposite direction
39
MACHINE LEARNINGTopics: gradient descent
��f(x)�x
Descent direction
40
MACHINE LEARNINGTopics: gradient descent
��f(x)�x
Descent direction
40
MACHINE LEARNINGTopics: gradient descent
��f(x)�x
Descent direction
40
MACHINE LEARNINGTopics: gradient descent
��f(x)�x
Descent direction
40
MACHINE LEARNINGTopics: gradient descent
��f(x)�x
Descent direction
40
MACHINE LEARNINGTopics: gradient descent
��f(x)�x
Descent direction
40
MACHINE LEARNINGTopics: gradient descent
��f(x)�x
Descent direction
40
MACHINE LEARNINGTopics: gradient descent• Gradient descent for empirical risk minimization‣ initialize
‣ for N iterations-
-
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
5
41
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ + ↵ �
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
• v
>r2x
f(x)v < 0 8v
• � = �r✓l(f(x(t);✓), y(t))� �r✓⌦(✓)
• (x
(t), y(t))
• f⇤ f
6
MACHINE LEARNINGTopics: local and global optima
42
MACHINE LEARNINGTopics: critical points, local optima, saddle point, curvature• Critical points:
• Curvature in direction :
• Types of critical points:‣ local minima: (i.e. positive definite)
‣ local maxima: (i.e. negative definite)
‣ saddle point: curvature is positive in certain directions and negative in others
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
• v
>r2x
f(x)v < 0 8v
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
• v
>r2x
f(x)v < 0 8v
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
• v
>r2x
f(x)v < 0 8v
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
• v
>r2x
f(x)v < 0 8v
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
• v
>r2x
f(x)v < 0 8v
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
• v
>r2x
f(x)v < 0 8v
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
• v
>r2x
f(x)v < 0 8v
5
43
MACHINE LEARNINGTopics: saddle point
44
MACHINE LEARNINGTopics: stochastic gradient descent• Algorithm that performs updates after each example‣ initialize
‣ for N iterations- for each training example
✓
✓
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
• v
>r2x
f(x)v < 0 8v
• � = �r✓l(f(x(t);✓), y(t))� �r✓⌦(✓)
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
• v
>r2x
f(x)v < 0 8v
• � = �r✓l(f(x(t);✓), y(t))� �r✓⌦(✓)
• (x
(t), y(t))
5
• bµ =
1T
Pt x
(t)
• b�2=
1T�1
Pt(x
(t) � bµ)2
• b⌃ =
1T�1
Pt(x
(t) � bµ)(x(t) � bµ)>
• E[
bµ] = µ E[b�2] = �2
E
hb⌃
i= ⌃
• bµ�µpb�2/T
• µ 2 bµ±�1.96p
b�2/T
•b✓ = argmax
✓p(x(1), . . . ,x(T )
)
•p(x(1), . . . ,x(T )
) =
Y
t
p(x(t))
• T�1T
b⌃ =
1T
Pt(x
(t) � bµ)(x(t) � bµ)>
Machine learning
• Supervised learning example: (x, y) x y
• Training set: Dtrain= {(x(t), y(t))}
• f(x;✓)
• Dvalid Dtest
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
5
45
•argmin
✓
1
T
X
t
l(f(x(t);✓), y(t)) + �⌦(✓)
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ + ↵ �
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
• v
>r2x
f(x)v < 0 8v
• � = �r✓l(f(x(t);✓), y(t))� �r✓⌦(✓)
• (x
(t), y(t))
• f⇤ f
6
MACHINE LEARNINGTopics: bias-variance trade-off• Variance of trained model: does it vary a lot if the training set
changes • Bias of trained model: is the average model close to the true
solution• Generalization error can be seen as the sum of bias and the
variance
low variance/high bias good trade-off high variance/
low bias
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
• v
>r2x
f(x)v < 0 8v
• � = �r✓l(f(x(t);✓), y(t))� �r✓⌦(✓)
• (x
(t), y(t))
• f⇤ f
6
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
• v
>r2x
f(x)v < 0 8v
• � = �r✓l(f(x(t);✓), y(t))� �r✓⌦(✓)
• (x
(t), y(t))
• f⇤ f
6
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
• v
>r2x
f(x)v < 0 8v
• � = �r✓l(f(x(t);✓), y(t))� �r✓⌦(✓)
• (x
(t), y(t))
• f⇤ f
6
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
• v
>r2x
f(x)v < 0 8v
• � = �r✓l(f(x(t);✓), y(t))� �r✓⌦(✓)
• (x
(t), y(t))
• f⇤ f
6
possible
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
• v
>r2x
f(x)v < 0 8v
• � = �r✓l(f(x(t);✓), y(t))� �r✓⌦(✓)
• (x
(t), y(t))
• f⇤ f
6
possible
• l(f(x(t);✓), y(t))
• ⌦(✓)
• � = � 1T
Ptr✓l(f(x
(t);✓), y(t))� �r✓⌦(✓)
• ✓ ✓ +�
• {x 2 Rd | rx
f(x) = 0}
• v
>r2x
f(x)v > 0 8v
• v
>r2x
f(x)v < 0 8v
• � = �r✓l(f(x(t);✓), y(t))� �r✓⌦(✓)
• (x
(t), y(t))
• f⇤ f
6
possible
46
MACHINE LEARNINGTopics: parametric vs. non-parametric• Parametric model: its capacity is fixed and does not increase
with the amount of training data‣ examples: linear classifier, neural network with fixed number of hidden units, etc.
•Non-parametric model: the capacity increases with the amount of training data‣ examples: k nearest neighbors classifier, neural network with adaptable hidden
layer size, etc.
47
PYTHONhttp://docs.python.org/tutorial/ (EN)
http://www.dmi.usherb.ca/~larocheh/cours/tutoriel_python.html (FR)
48
MLPYTHONhttp://www.dmi.usherb.ca/~larocheh/mlpython/tutorial.html#tutorial (EN)
49
top related