lecture 1: introduction - github pages · lecture 1: introduction andr e martins deep structured...
TRANSCRIPT
![Page 1: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/1.jpg)
Lecture 1: Introduction
Andre Martins
Deep Structured Learning Course, Fall 2019
Andre Martins (IST) Lecture 1 IST, Fall 2019 1 / 82
![Page 2: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/2.jpg)
Course Website
https://andre-martins.github.io/pages/
deep-structured-learning-ist-fall-2019.html
There I’ll post:
• Syllabus
• Lecture slides
• Literature pointers
• Homework assignments
• ...
Andre Martins (IST) Lecture 1 IST, Fall 2019 2 / 82
![Page 3: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/3.jpg)
Outline
1 Introduction
2 Class Administrativia
3 Recap
Linear Algebra
Probability Theory
Optimization
Andre Martins (IST) Lecture 1 IST, Fall 2019 3 / 82
![Page 4: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/4.jpg)
What is “‘Deep Learning”?
• Neural networks?
• Neural networks with many hidden layers?
• Anything beyond shallow (linear) models for statistical learning?
• Anything that learns representations?
• A form of learning that is really intense and profound?
Andre Martins (IST) Lecture 1 IST, Fall 2019 4 / 82
![Page 5: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/5.jpg)
What is “‘Deep Learning”?
• Neural networks?
• Neural networks with many hidden layers?
• Anything beyond shallow (linear) models for statistical learning?
• Anything that learns representations?
• A form of learning that is really intense and profound?
Andre Martins (IST) Lecture 1 IST, Fall 2019 4 / 82
![Page 6: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/6.jpg)
Where is the “Structure”?
• In the input objects (text, graphs, images, ...)
• In the outputs we want to predict (parsing, graph labeling, imagesegmentation, ...)
• In our model (convolutional networks, attention mechanisms)
• Related: latent structure (typically a way of encoding priorknowledge into the model)
Andre Martins (IST) Lecture 1 IST, Fall 2019 5 / 82
![Page 7: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/7.jpg)
This Course: “Deep Learning + Structure”
Andre Martins (IST) Lecture 1 IST, Fall 2019 6 / 82
![Page 8: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/8.jpg)
Why Did Deep Learning Become Mainstream?
Lots of recent breakthroughs:
• Object recognition
• Speech and language processing
• Chatbots and dialog systems
• Self-driving cars
• Machine translation
• Solving games (Atari, Go)
No signs of slowing down...
Andre Martins (IST) Lecture 1 IST, Fall 2019 7 / 82
![Page 9: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/9.jpg)
Andre Martins (IST) Lecture 1 IST, Fall 2019 8 / 82
![Page 10: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/10.jpg)
Andre Martins (IST) Lecture 1 IST, Fall 2019 9 / 82
![Page 11: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/11.jpg)
Andre Martins (IST) Lecture 1 IST, Fall 2019 10 / 82
![Page 12: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/12.jpg)
Andre Martins (IST) Lecture 1 IST, Fall 2019 11 / 82
![Page 13: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/13.jpg)
Andre Martins (IST) Lecture 1 IST, Fall 2019 12 / 82
![Page 14: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/14.jpg)
Andre Martins (IST) Lecture 1 IST, Fall 2019 13 / 82
![Page 15: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/15.jpg)
Andre Martins (IST) Lecture 1 IST, Fall 2019 14 / 82
![Page 16: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/16.jpg)
Why Now?
Why does deep learning work now, but not 20 years ago?
Many of the core ideas were there, after all.
But now we have:
• more data
• more computing power
• better software engineering
• a few algorithmic innovations (many layers, ReLUs, betterinitialization and learning rates, dropout, LSTMs, convolutional nets)
Andre Martins (IST) Lecture 1 IST, Fall 2019 15 / 82
![Page 17: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/17.jpg)
“But It’s Non-Convex”
Why does gradient-based optimization work at all in neural nets despitethe non-convexity?
One possible, partial answer:
• there are generally many hidden units
• there are many ways a neural net can approximately implement thedesired input-output relationship
• we only need to find one
Andre Martins (IST) Lecture 1 IST, Fall 2019 16 / 82
![Page 18: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/18.jpg)
Recommended Books
Main book:
• Deep Learning. Ian Goodfellow,Yoshua Bengio, and Aaron Courville.MIT Press, 2016. Chapters available athttp://deeplearningbook.org
Andre Martins (IST) Lecture 1 IST, Fall 2019 17 / 82
![Page 19: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/19.jpg)
Recommended Books
Secondary books:
• Machine Learning: a Probabilistic Perspective. Kevin P. Murphy.MIT Press, 2013.
• Linguistic Structured Prediction. Noah A. Smith. Morgan &Claypool Synthesis Lectures on Human Language Technologies. 2011.
Andre Martins (IST) Lecture 1 IST, Fall 2019 18 / 82
![Page 20: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/20.jpg)
Tentative Syllabus
Sep 16–20 Introduction and Course DescriptionSep 23–27 Linear ClassifiersSep 30–Oct 4 Feedforward Neural NetworksOct 7–11 Training Neural NetworksOct 14–18 Linear Sequence ModelsOct 21–25 Representation Learning and Convolutional NetworksOct 28–Nov 1 Structured Prediction and Graphical ModelsNov 4–8 Recurrent Neural NetworksNov 11–15 Sequence-to-Sequence LearningNov 18–22 Attention Mechanisms and Neural MemoriesNov 25–29 Deep Reinforcement LearningDec 2–6 Deep Generative Models (VAEs, GANs)Dec 9–13 Final Projects IDec 16–20 Final Projects II
Andre Martins (IST) Lecture 1 IST, Fall 2019 19 / 82
![Page 21: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/21.jpg)
Outline
1 Introduction
2 Class Administrativia
3 Recap
Linear Algebra
Probability Theory
Optimization
Andre Martins (IST) Lecture 1 IST, Fall 2019 20 / 82
![Page 22: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/22.jpg)
What This Class Is About
• Introduction to deep learning
• Introduction to structured prediction
• Goal: after finishing this class, you should be able to:• Understand how deep learning works without magic• Understand the intuition behind deep structured learning models• Apply the learned techniques on a practical problem (NLP, vision, ...)
• Target audience:• MSc/PhD students with basic background in ML and good
programming skills
Andre Martins (IST) Lecture 1 IST, Fall 2019 21 / 82
![Page 23: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/23.jpg)
What This Class Is Not About
It’s not about:
• Just playing with a deep learning toolkit without learning thefundamental concepts
• Introduction to ML (see Mario Figueiredo’s Statistical Learningcourse and Jorge Marques’ Estimation and Classification course)
• Optimization (check Joao Xavier’s Non-Linear Optimization course)
• Natural Language Processing
• ...
Andre Martins (IST) Lecture 1 IST, Fall 2019 22 / 82
![Page 24: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/24.jpg)
Prerequisites
• Calculus and basic linear algebra
• Basic probability theory
• Basic knowledge of machine learning
• Programming (Python & PyTorch preferred)
• Helpful: basic optimization
Andre Martins (IST) Lecture 1 IST, Fall 2019 23 / 82
![Page 25: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/25.jpg)
Course Information
• Instructors: Andre Martins & Vlad Niculae
• TAs: Goncalo Correia & Ben Peters
• Location: LT2 (North Tower, 4th floor)
• Schedule: Mondays/Fridays 10:00–11:30 (tentative)
• Communication:piazza.com/tecnico.ulisboa.pt/fall2019/pdeecdsl
Andre Martins (IST) Lecture 1 IST, Fall 2019 24 / 82
![Page 26: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/26.jpg)
Grading
• 4 homework assignments: 60%• Theoretical questions & implementation• Late days: 10% penalization each late day
• Final project (in groups of 2–3): 40%• Final class presentations & poster session (tentative)
Andre Martins (IST) Lecture 1 IST, Fall 2019 25 / 82
![Page 27: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/27.jpg)
Final Project
• Possible idea: apply a deep learning technique to a structuredproblem relevant to your research (NLP, vision, robotics, ...)
• Otherwise, pick a project from a list of suggestions
• Must be finished this semester
• Four evaluation stages: project proposal (10%), midterm report(10%), final report (10%, conference paper format), classpresentation (10%)
• List of project suggestions will be made available soon
Andre Martins (IST) Lecture 1 IST, Fall 2019 26 / 82
![Page 28: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/28.jpg)
Collaboration Policy
• Assignments are individual
• Students may discuss the questions, as long as they write their ownanswers and their own code
• If this happens, acknowledge with whom you collaborate!
• Zero tolerance on plagiarism!!
• Always credit your sources!!!
Andre Martins (IST) Lecture 1 IST, Fall 2019 27 / 82
![Page 29: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/29.jpg)
Caveat
• This is the second year I’m teaching this class
• ... which means you’re the second batch of students taking it :)
• Constructive feedback will be highly appreciated (and encouraged!)
Andre Martins (IST) Lecture 1 IST, Fall 2019 28 / 82
![Page 30: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/30.jpg)
Questions?
Andre Martins (IST) Lecture 1 IST, Fall 2019 29 / 82
![Page 31: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/31.jpg)
Outline
1 Introduction
2 Class Administrativia
3 Recap
Linear Algebra
Probability Theory
Optimization
Andre Martins (IST) Lecture 1 IST, Fall 2019 30 / 82
![Page 32: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/32.jpg)
Quick Background Recap
Slide credits: Prof. Mario Figueiredo (taken from his LxMLS class)
Andre Martins (IST) Lecture 1 IST, Fall 2019 31 / 82
![Page 33: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/33.jpg)
Outline
1 Introduction
2 Class Administrativia
3 Recap
Linear Algebra
Probability Theory
Optimization
Andre Martins (IST) Lecture 1 IST, Fall 2019 32 / 82
![Page 34: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/34.jpg)
Linear Algebra
• Linear algebra provides (among many other things) a compact way ofrepresenting, studying, and solving linear systems of equations
• Example: the system
4 x1 − 5 x2 = −13
−2 x1 + 3 x2 = 9
can be written compactly as Ax = b , where
A =
[4 −5−2 3
], b =
[−13
9
],
and can be solved as
x = A−1b =
[1.5 2.51 2
] [−13
9
]=
[35
].
Andre Martins (IST) Lecture 1 IST, Fall 2019 33 / 82
![Page 35: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/35.jpg)
Linear Algebra
• Linear algebra provides (among many other things) a compact way ofrepresenting, studying, and solving linear systems of equations
• Example: the system
4 x1 − 5 x2 = −13
−2 x1 + 3 x2 = 9
can be written compactly as Ax = b , where
A =
[4 −5−2 3
], b =
[−13
9
],
and can be solved as
x = A−1b =
[1.5 2.51 2
] [−13
9
]=
[35
].
Andre Martins (IST) Lecture 1 IST, Fall 2019 33 / 82
![Page 36: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/36.jpg)
Notation: Matrices and Vectors
• A ∈ Rm×n is a matrix with m rows and n columns.
A =
A1,1 · · · A1,n...
. . ....
Am,1 · · · Am,n
.
• x ∈ Rn is a vector with n components,
x =
x1...xn
.• A (column) vector is a matrix with n rows and 1 column.
• A matrix with 1 row and n columns is called a row vector.
Andre Martins (IST) Lecture 1 IST, Fall 2019 34 / 82
![Page 37: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/37.jpg)
Notation: Matrices and Vectors
• A ∈ Rm×n is a matrix with m rows and n columns.
A =
A1,1 · · · A1,n...
. . ....
Am,1 · · · Am,n
.• x ∈ Rn is a vector with n components,
x =
x1...xn
.
• A (column) vector is a matrix with n rows and 1 column.
• A matrix with 1 row and n columns is called a row vector.
Andre Martins (IST) Lecture 1 IST, Fall 2019 34 / 82
![Page 38: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/38.jpg)
Notation: Matrices and Vectors
• A ∈ Rm×n is a matrix with m rows and n columns.
A =
A1,1 · · · A1,n...
. . ....
Am,1 · · · Am,n
.• x ∈ Rn is a vector with n components,
x =
x1...xn
.• A (column) vector is a matrix with n rows and 1 column.
• A matrix with 1 row and n columns is called a row vector.
Andre Martins (IST) Lecture 1 IST, Fall 2019 34 / 82
![Page 39: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/39.jpg)
Notation: Matrices and Vectors
• A ∈ Rm×n is a matrix with m rows and n columns.
A =
A1,1 · · · A1,n...
. . ....
Am,1 · · · Am,n
.• x ∈ Rn is a vector with n components,
x =
x1...xn
.• A (column) vector is a matrix with n rows and 1 column.
• A matrix with 1 row and n columns is called a row vector.
Andre Martins (IST) Lecture 1 IST, Fall 2019 34 / 82
![Page 40: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/40.jpg)
Matrix Transpose and Products
• Given matrix A ∈ Rm×n, its transpose AT is such that (AT )i ,j = Aj ,i .
• A matrix A is symmetric if AT = A.
• Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is
C = AB ∈ Rm×p where Ci ,j =n∑
k=1
Ai ,k Bk,j
• Inner product between vectors x , y ∈ Rn:
〈x , y〉 = xT y = yT x =n∑
i=1
xiyi ∈ R.
• Outer product between vectors x ∈ Rn and y ∈ Rm: x yT ∈ Rn×m,where (x yT )i ,j = xi yj .
Andre Martins (IST) Lecture 1 IST, Fall 2019 35 / 82
![Page 41: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/41.jpg)
Matrix Transpose and Products
• Given matrix A ∈ Rm×n, its transpose AT is such that (AT )i ,j = Aj ,i .
• A matrix A is symmetric if AT = A.
• Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is
C = AB ∈ Rm×p where Ci ,j =n∑
k=1
Ai ,k Bk,j
• Inner product between vectors x , y ∈ Rn:
〈x , y〉 = xT y = yT x =n∑
i=1
xiyi ∈ R.
• Outer product between vectors x ∈ Rn and y ∈ Rm: x yT ∈ Rn×m,where (x yT )i ,j = xi yj .
Andre Martins (IST) Lecture 1 IST, Fall 2019 35 / 82
![Page 42: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/42.jpg)
Matrix Transpose and Products
• Given matrix A ∈ Rm×n, its transpose AT is such that (AT )i ,j = Aj ,i .
• A matrix A is symmetric if AT = A.
• Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is
C = AB ∈ Rm×p where Ci ,j =n∑
k=1
Ai ,k Bk,j
• Inner product between vectors x , y ∈ Rn:
〈x , y〉 = xT y = yT x =n∑
i=1
xiyi ∈ R.
• Outer product between vectors x ∈ Rn and y ∈ Rm: x yT ∈ Rn×m,where (x yT )i ,j = xi yj .
Andre Martins (IST) Lecture 1 IST, Fall 2019 35 / 82
![Page 43: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/43.jpg)
Matrix Transpose and Products
• Given matrix A ∈ Rm×n, its transpose AT is such that (AT )i ,j = Aj ,i .
• A matrix A is symmetric if AT = A.
• Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is
C = AB ∈ Rm×p where Ci ,j =n∑
k=1
Ai ,k Bk,j
• Inner product between vectors x , y ∈ Rn:
〈x , y〉 = xT y = yT x =n∑
i=1
xiyi ∈ R.
• Outer product between vectors x ∈ Rn and y ∈ Rm: x yT ∈ Rn×m,where (x yT )i ,j = xi yj .
Andre Martins (IST) Lecture 1 IST, Fall 2019 35 / 82
![Page 44: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/44.jpg)
Matrix Transpose and Products
• Given matrix A ∈ Rm×n, its transpose AT is such that (AT )i ,j = Aj ,i .
• A matrix A is symmetric if AT = A.
• Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is
C = AB ∈ Rm×p where Ci ,j =n∑
k=1
Ai ,k Bk,j
• Inner product between vectors x , y ∈ Rn:
〈x , y〉 = xT y = yT x =n∑
i=1
xiyi ∈ R.
• Outer product between vectors x ∈ Rn and y ∈ Rm: x yT ∈ Rn×m,where (x yT )i ,j = xi yj .
Andre Martins (IST) Lecture 1 IST, Fall 2019 35 / 82
![Page 45: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/45.jpg)
Properties of Matrix Products and Transposes
• Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is
C = AB ∈ Rm×p where Ci ,j =n∑
k=1
Ai ,k Bk,j
• Matrix product is associative: (AB)C = A(BC ).
• In general, matrix product is not commutative: AB 6= BA.
• Transpose of product: (AB)T = BTAT .
• Transpose of sum: (A + B)T = AT + BT .
Andre Martins (IST) Lecture 1 IST, Fall 2019 36 / 82
![Page 46: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/46.jpg)
Properties of Matrix Products and Transposes
• Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is
C = AB ∈ Rm×p where Ci ,j =n∑
k=1
Ai ,k Bk,j
• Matrix product is associative: (AB)C = A(BC ).
• In general, matrix product is not commutative: AB 6= BA.
• Transpose of product: (AB)T = BTAT .
• Transpose of sum: (A + B)T = AT + BT .
Andre Martins (IST) Lecture 1 IST, Fall 2019 36 / 82
![Page 47: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/47.jpg)
Properties of Matrix Products and Transposes
• Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is
C = AB ∈ Rm×p where Ci ,j =n∑
k=1
Ai ,k Bk,j
• Matrix product is associative: (AB)C = A(BC ).
• In general, matrix product is not commutative: AB 6= BA.
• Transpose of product: (AB)T = BTAT .
• Transpose of sum: (A + B)T = AT + BT .
Andre Martins (IST) Lecture 1 IST, Fall 2019 36 / 82
![Page 48: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/48.jpg)
Properties of Matrix Products and Transposes
• Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is
C = AB ∈ Rm×p where Ci ,j =n∑
k=1
Ai ,k Bk,j
• Matrix product is associative: (AB)C = A(BC ).
• In general, matrix product is not commutative: AB 6= BA.
• Transpose of product: (AB)T = BTAT .
• Transpose of sum: (A + B)T = AT + BT .
Andre Martins (IST) Lecture 1 IST, Fall 2019 36 / 82
![Page 49: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/49.jpg)
Properties of Matrix Products and Transposes
• Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is
C = AB ∈ Rm×p where Ci ,j =n∑
k=1
Ai ,k Bk,j
• Matrix product is associative: (AB)C = A(BC ).
• In general, matrix product is not commutative: AB 6= BA.
• Transpose of product: (AB)T = BTAT .
• Transpose of sum: (A + B)T = AT + BT .
Andre Martins (IST) Lecture 1 IST, Fall 2019 36 / 82
![Page 50: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/50.jpg)
Norms
• The norm of a vector is (informally) its “magnitude.” Euclidean norm:
‖x‖2 =√〈x , x〉 =
√xT x =
√√√√ n∑i=1
x2i .
• More generally, the `p norm of a vector x ∈ Rn, where p ≥ 1,
‖x‖p =
(n∑
i=1
|xi |p)1/p
.
• Notable case: the `1 norm, ‖x‖1 =∑
i |xi |.
• Notable case: the `∞ norm, ‖x‖∞ = max{|x1|, ..., |xn|}.
• Notable case: the `0 “norm” (not): ‖x‖0 = |{i : xi 6= 0}|.
Andre Martins (IST) Lecture 1 IST, Fall 2019 37 / 82
![Page 51: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/51.jpg)
Norms
• The norm of a vector is (informally) its “magnitude.” Euclidean norm:
‖x‖2 =√〈x , x〉 =
√xT x =
√√√√ n∑i=1
x2i .
• More generally, the `p norm of a vector x ∈ Rn, where p ≥ 1,
‖x‖p =
(n∑
i=1
|xi |p)1/p
.
• Notable case: the `1 norm, ‖x‖1 =∑
i |xi |.
• Notable case: the `∞ norm, ‖x‖∞ = max{|x1|, ..., |xn|}.
• Notable case: the `0 “norm” (not): ‖x‖0 = |{i : xi 6= 0}|.
Andre Martins (IST) Lecture 1 IST, Fall 2019 37 / 82
![Page 52: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/52.jpg)
Norms
• The norm of a vector is (informally) its “magnitude.” Euclidean norm:
‖x‖2 =√〈x , x〉 =
√xT x =
√√√√ n∑i=1
x2i .
• More generally, the `p norm of a vector x ∈ Rn, where p ≥ 1,
‖x‖p =
(n∑
i=1
|xi |p)1/p
.
• Notable case: the `1 norm, ‖x‖1 =∑
i |xi |.
• Notable case: the `∞ norm, ‖x‖∞ = max{|x1|, ..., |xn|}.
• Notable case: the `0 “norm” (not): ‖x‖0 = |{i : xi 6= 0}|.
Andre Martins (IST) Lecture 1 IST, Fall 2019 37 / 82
![Page 53: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/53.jpg)
Norms
• The norm of a vector is (informally) its “magnitude.” Euclidean norm:
‖x‖2 =√〈x , x〉 =
√xT x =
√√√√ n∑i=1
x2i .
• More generally, the `p norm of a vector x ∈ Rn, where p ≥ 1,
‖x‖p =
(n∑
i=1
|xi |p)1/p
.
• Notable case: the `1 norm, ‖x‖1 =∑
i |xi |.
• Notable case: the `∞ norm, ‖x‖∞ = max{|x1|, ..., |xn|}.
• Notable case: the `0 “norm” (not): ‖x‖0 = |{i : xi 6= 0}|.
Andre Martins (IST) Lecture 1 IST, Fall 2019 37 / 82
![Page 54: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/54.jpg)
Norms
• The norm of a vector is (informally) its “magnitude.” Euclidean norm:
‖x‖2 =√〈x , x〉 =
√xT x =
√√√√ n∑i=1
x2i .
• More generally, the `p norm of a vector x ∈ Rn, where p ≥ 1,
‖x‖p =
(n∑
i=1
|xi |p)1/p
.
• Notable case: the `1 norm, ‖x‖1 =∑
i |xi |.
• Notable case: the `∞ norm, ‖x‖∞ = max{|x1|, ..., |xn|}.
• Notable case: the `0 “norm” (not): ‖x‖0 = |{i : xi 6= 0}|.
Andre Martins (IST) Lecture 1 IST, Fall 2019 37 / 82
![Page 55: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/55.jpg)
Special Matrices
• The identity matrix I ∈ Rn×n is a square matrix such that
Iij =
{1 i = j0 i 6= j
I =
1 0 · · · 00 1 · · · 0...
.... . .
...0 0 · · · 1
• Neutral element of matrix product: A I = I A = A.
• Diagonal matrix: A ∈ Rn×n is diagonal if (i 6= j)⇒ Ai ,j = 0.
• Upper triangular matrix: (j < i)⇒ Ai ,j = 0.
• Lower triangular matrix: (j > i)⇒ Ai ,j = 0.
Andre Martins (IST) Lecture 1 IST, Fall 2019 38 / 82
![Page 56: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/56.jpg)
Special Matrices
• The identity matrix I ∈ Rn×n is a square matrix such that
Iij =
{1 i = j0 i 6= j
I =
1 0 · · · 00 1 · · · 0...
.... . .
...0 0 · · · 1
• Neutral element of matrix product: A I = I A = A.
• Diagonal matrix: A ∈ Rn×n is diagonal if (i 6= j)⇒ Ai ,j = 0.
• Upper triangular matrix: (j < i)⇒ Ai ,j = 0.
• Lower triangular matrix: (j > i)⇒ Ai ,j = 0.
Andre Martins (IST) Lecture 1 IST, Fall 2019 38 / 82
![Page 57: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/57.jpg)
Special Matrices
• The identity matrix I ∈ Rn×n is a square matrix such that
Iij =
{1 i = j0 i 6= j
I =
1 0 · · · 00 1 · · · 0...
.... . .
...0 0 · · · 1
• Neutral element of matrix product: A I = I A = A.
• Diagonal matrix: A ∈ Rn×n is diagonal if (i 6= j)⇒ Ai ,j = 0.
• Upper triangular matrix: (j < i)⇒ Ai ,j = 0.
• Lower triangular matrix: (j > i)⇒ Ai ,j = 0.
Andre Martins (IST) Lecture 1 IST, Fall 2019 38 / 82
![Page 58: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/58.jpg)
Special Matrices
• The identity matrix I ∈ Rn×n is a square matrix such that
Iij =
{1 i = j0 i 6= j
I =
1 0 · · · 00 1 · · · 0...
.... . .
...0 0 · · · 1
• Neutral element of matrix product: A I = I A = A.
• Diagonal matrix: A ∈ Rn×n is diagonal if (i 6= j)⇒ Ai ,j = 0.
• Upper triangular matrix: (j < i)⇒ Ai ,j = 0.
• Lower triangular matrix: (j > i)⇒ Ai ,j = 0.
Andre Martins (IST) Lecture 1 IST, Fall 2019 38 / 82
![Page 59: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/59.jpg)
Special Matrices
• The identity matrix I ∈ Rn×n is a square matrix such that
Iij =
{1 i = j0 i 6= j
I =
1 0 · · · 00 1 · · · 0...
.... . .
...0 0 · · · 1
• Neutral element of matrix product: A I = I A = A.
• Diagonal matrix: A ∈ Rn×n is diagonal if (i 6= j)⇒ Ai ,j = 0.
• Upper triangular matrix: (j < i)⇒ Ai ,j = 0.
• Lower triangular matrix: (j > i)⇒ Ai ,j = 0.
Andre Martins (IST) Lecture 1 IST, Fall 2019 38 / 82
![Page 60: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/60.jpg)
Eigenvalues, eigenvectors, determinant, trace
• A vector x ∈ Rn is an eigenvector of matrix A ∈ Rn×n if
Ax = λ x ,
where λ ∈ R is the corresponding eigenvalue.
• The eigenvalues of a diagonal matrix are the elements in the diagonal.
• Matrix trace:trace(A) =
∑i
Ai ,i =∑i
λi
• Matrix determinant:
|A| = det(A) =∏i
λi
• Properties: |AB| = |A||B|,
|AT | = |A|, |αA| = αn|A|
Andre Martins (IST) Lecture 1 IST, Fall 2019 39 / 82
![Page 61: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/61.jpg)
Eigenvalues, eigenvectors, determinant, trace
• A vector x ∈ Rn is an eigenvector of matrix A ∈ Rn×n if
Ax = λ x ,
where λ ∈ R is the corresponding eigenvalue.
• The eigenvalues of a diagonal matrix are the elements in the diagonal.
• Matrix trace:trace(A) =
∑i
Ai ,i =∑i
λi
• Matrix determinant:
|A| = det(A) =∏i
λi
• Properties: |AB| = |A||B|,
|AT | = |A|, |αA| = αn|A|
Andre Martins (IST) Lecture 1 IST, Fall 2019 39 / 82
![Page 62: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/62.jpg)
Eigenvalues, eigenvectors, determinant, trace
• A vector x ∈ Rn is an eigenvector of matrix A ∈ Rn×n if
Ax = λ x ,
where λ ∈ R is the corresponding eigenvalue.
• The eigenvalues of a diagonal matrix are the elements in the diagonal.
• Matrix trace:trace(A) =
∑i
Ai ,i =∑i
λi
• Matrix determinant:
|A| = det(A) =∏i
λi
• Properties: |AB| = |A||B|,
|AT | = |A|, |αA| = αn|A|
Andre Martins (IST) Lecture 1 IST, Fall 2019 39 / 82
![Page 63: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/63.jpg)
Eigenvalues, eigenvectors, determinant, trace
• A vector x ∈ Rn is an eigenvector of matrix A ∈ Rn×n if
Ax = λ x ,
where λ ∈ R is the corresponding eigenvalue.
• The eigenvalues of a diagonal matrix are the elements in the diagonal.
• Matrix trace:trace(A) =
∑i
Ai ,i =∑i
λi
• Matrix determinant:
|A| = det(A) =∏i
λi
• Properties: |AB| = |A||B|,
|AT | = |A|, |αA| = αn|A|
Andre Martins (IST) Lecture 1 IST, Fall 2019 39 / 82
![Page 64: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/64.jpg)
Eigenvalues, eigenvectors, determinant, trace
• A vector x ∈ Rn is an eigenvector of matrix A ∈ Rn×n if
Ax = λ x ,
where λ ∈ R is the corresponding eigenvalue.
• The eigenvalues of a diagonal matrix are the elements in the diagonal.
• Matrix trace:trace(A) =
∑i
Ai ,i =∑i
λi
• Matrix determinant:
|A| = det(A) =∏i
λi
• Properties: |AB| = |A||B|,
|AT | = |A|, |αA| = αn|A|
Andre Martins (IST) Lecture 1 IST, Fall 2019 39 / 82
![Page 65: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/65.jpg)
Eigenvalues, eigenvectors, determinant, trace
• A vector x ∈ Rn is an eigenvector of matrix A ∈ Rn×n if
Ax = λ x ,
where λ ∈ R is the corresponding eigenvalue.
• The eigenvalues of a diagonal matrix are the elements in the diagonal.
• Matrix trace:trace(A) =
∑i
Ai ,i =∑i
λi
• Matrix determinant:
|A| = det(A) =∏i
λi
• Properties: |AB| = |A||B|, |AT | = |A|,
|αA| = αn|A|
Andre Martins (IST) Lecture 1 IST, Fall 2019 39 / 82
![Page 66: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/66.jpg)
Eigenvalues, eigenvectors, determinant, trace
• A vector x ∈ Rn is an eigenvector of matrix A ∈ Rn×n if
Ax = λ x ,
where λ ∈ R is the corresponding eigenvalue.
• The eigenvalues of a diagonal matrix are the elements in the diagonal.
• Matrix trace:trace(A) =
∑i
Ai ,i =∑i
λi
• Matrix determinant:
|A| = det(A) =∏i
λi
• Properties: |AB| = |A||B|, |AT | = |A|, |αA| = αn|A|Andre Martins (IST) Lecture 1 IST, Fall 2019 39 / 82
![Page 67: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/67.jpg)
Matrix Inverse
• Matrix A ∈ Rn×n in invertible if there is B ∈ Rn×n s.t. AB = BA = I .
• ...matrix B, such that AB = BA = I , denoted B = A−1.
• Matrix A ∈ Rn×n is invertible ⇔ det(A) 6= 0.
• Determinant of inverse: det(A−1) =1
det(A).
• Solving system Ax = b, if A is invertible: x = A−1b.
• Properties: (A−1)−1 = A,
(A−1)T = (AT )−1, (AB)−1 = B−1A−1
• There are many algorithms to compute A−1; general case,computational cost O(n3).
Andre Martins (IST) Lecture 1 IST, Fall 2019 40 / 82
![Page 68: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/68.jpg)
Matrix Inverse
• Matrix A ∈ Rn×n in invertible if there is B ∈ Rn×n s.t. AB = BA = I .
• ...matrix B, such that AB = BA = I , denoted B = A−1.
• Matrix A ∈ Rn×n is invertible ⇔ det(A) 6= 0.
• Determinant of inverse: det(A−1) =1
det(A).
• Solving system Ax = b, if A is invertible: x = A−1b.
• Properties: (A−1)−1 = A,
(A−1)T = (AT )−1, (AB)−1 = B−1A−1
• There are many algorithms to compute A−1; general case,computational cost O(n3).
Andre Martins (IST) Lecture 1 IST, Fall 2019 40 / 82
![Page 69: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/69.jpg)
Matrix Inverse
• Matrix A ∈ Rn×n in invertible if there is B ∈ Rn×n s.t. AB = BA = I .
• ...matrix B, such that AB = BA = I , denoted B = A−1.
• Matrix A ∈ Rn×n is invertible ⇔ det(A) 6= 0.
• Determinant of inverse: det(A−1) =1
det(A).
• Solving system Ax = b, if A is invertible: x = A−1b.
• Properties: (A−1)−1 = A,
(A−1)T = (AT )−1, (AB)−1 = B−1A−1
• There are many algorithms to compute A−1; general case,computational cost O(n3).
Andre Martins (IST) Lecture 1 IST, Fall 2019 40 / 82
![Page 70: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/70.jpg)
Matrix Inverse
• Matrix A ∈ Rn×n in invertible if there is B ∈ Rn×n s.t. AB = BA = I .
• ...matrix B, such that AB = BA = I , denoted B = A−1.
• Matrix A ∈ Rn×n is invertible ⇔ det(A) 6= 0.
• Determinant of inverse: det(A−1) =1
det(A).
• Solving system Ax = b, if A is invertible: x = A−1b.
• Properties: (A−1)−1 = A,
(A−1)T = (AT )−1, (AB)−1 = B−1A−1
• There are many algorithms to compute A−1; general case,computational cost O(n3).
Andre Martins (IST) Lecture 1 IST, Fall 2019 40 / 82
![Page 71: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/71.jpg)
Matrix Inverse
• Matrix A ∈ Rn×n in invertible if there is B ∈ Rn×n s.t. AB = BA = I .
• ...matrix B, such that AB = BA = I , denoted B = A−1.
• Matrix A ∈ Rn×n is invertible ⇔ det(A) 6= 0.
• Determinant of inverse: det(A−1) =1
det(A).
• Solving system Ax = b, if A is invertible: x = A−1b.
• Properties: (A−1)−1 = A,
(A−1)T = (AT )−1, (AB)−1 = B−1A−1
• There are many algorithms to compute A−1; general case,computational cost O(n3).
Andre Martins (IST) Lecture 1 IST, Fall 2019 40 / 82
![Page 72: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/72.jpg)
Matrix Inverse
• Matrix A ∈ Rn×n in invertible if there is B ∈ Rn×n s.t. AB = BA = I .
• ...matrix B, such that AB = BA = I , denoted B = A−1.
• Matrix A ∈ Rn×n is invertible ⇔ det(A) 6= 0.
• Determinant of inverse: det(A−1) =1
det(A).
• Solving system Ax = b, if A is invertible: x = A−1b.
• Properties: (A−1)−1 = A,
(A−1)T = (AT )−1, (AB)−1 = B−1A−1
• There are many algorithms to compute A−1; general case,computational cost O(n3).
Andre Martins (IST) Lecture 1 IST, Fall 2019 40 / 82
![Page 73: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/73.jpg)
Matrix Inverse
• Matrix A ∈ Rn×n in invertible if there is B ∈ Rn×n s.t. AB = BA = I .
• ...matrix B, such that AB = BA = I , denoted B = A−1.
• Matrix A ∈ Rn×n is invertible ⇔ det(A) 6= 0.
• Determinant of inverse: det(A−1) =1
det(A).
• Solving system Ax = b, if A is invertible: x = A−1b.
• Properties: (A−1)−1 = A, (A−1)T = (AT )−1,
(AB)−1 = B−1A−1
• There are many algorithms to compute A−1; general case,computational cost O(n3).
Andre Martins (IST) Lecture 1 IST, Fall 2019 40 / 82
![Page 74: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/74.jpg)
Matrix Inverse
• Matrix A ∈ Rn×n in invertible if there is B ∈ Rn×n s.t. AB = BA = I .
• ...matrix B, such that AB = BA = I , denoted B = A−1.
• Matrix A ∈ Rn×n is invertible ⇔ det(A) 6= 0.
• Determinant of inverse: det(A−1) =1
det(A).
• Solving system Ax = b, if A is invertible: x = A−1b.
• Properties: (A−1)−1 = A, (A−1)T = (AT )−1, (AB)−1 = B−1A−1
• There are many algorithms to compute A−1; general case,computational cost O(n3).
Andre Martins (IST) Lecture 1 IST, Fall 2019 40 / 82
![Page 75: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/75.jpg)
Matrix Inverse
• Matrix A ∈ Rn×n in invertible if there is B ∈ Rn×n s.t. AB = BA = I .
• ...matrix B, such that AB = BA = I , denoted B = A−1.
• Matrix A ∈ Rn×n is invertible ⇔ det(A) 6= 0.
• Determinant of inverse: det(A−1) =1
det(A).
• Solving system Ax = b, if A is invertible: x = A−1b.
• Properties: (A−1)−1 = A, (A−1)T = (AT )−1, (AB)−1 = B−1A−1
• There are many algorithms to compute A−1; general case,computational cost O(n3).
Andre Martins (IST) Lecture 1 IST, Fall 2019 40 / 82
![Page 76: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/76.jpg)
Quadratic Forms and Positive (Semi-)DefiniteMatrices
• Given matrix A ∈ Rn×n and vector x ∈ Rn,
xTAx =n∑
i=1
n∑j=1
Ai , j xi xj ∈ R
is called a quadratic form.
• A symmetric matrix A ∈ Rn×n is positive semi-definite (PSD) if, forany x ∈ Rn, xTAx ≥ 0.
• A symmetric matrix A ∈ Rn×n is positive definite (PD) if, for anyx ∈ Rn, (x 6= 0)⇒ xTAx > 0.
• Matrix A ∈ Rn×n is PSD ⇔ all λi (A) ≥ 0.
• Matrix A ∈ Rn×n is PD ⇔ all λi (A) > 0.
Andre Martins (IST) Lecture 1 IST, Fall 2019 41 / 82
![Page 77: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/77.jpg)
Quadratic Forms and Positive (Semi-)DefiniteMatrices
• Given matrix A ∈ Rn×n and vector x ∈ Rn,
xTAx =n∑
i=1
n∑j=1
Ai , j xi xj ∈ R
is called a quadratic form.
• A symmetric matrix A ∈ Rn×n is positive semi-definite (PSD) if, forany x ∈ Rn, xTAx ≥ 0.
• A symmetric matrix A ∈ Rn×n is positive definite (PD) if, for anyx ∈ Rn, (x 6= 0)⇒ xTAx > 0.
• Matrix A ∈ Rn×n is PSD ⇔ all λi (A) ≥ 0.
• Matrix A ∈ Rn×n is PD ⇔ all λi (A) > 0.
Andre Martins (IST) Lecture 1 IST, Fall 2019 41 / 82
![Page 78: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/78.jpg)
Quadratic Forms and Positive (Semi-)DefiniteMatrices
• Given matrix A ∈ Rn×n and vector x ∈ Rn,
xTAx =n∑
i=1
n∑j=1
Ai , j xi xj ∈ R
is called a quadratic form.
• A symmetric matrix A ∈ Rn×n is positive semi-definite (PSD) if, forany x ∈ Rn, xTAx ≥ 0.
• A symmetric matrix A ∈ Rn×n is positive definite (PD) if, for anyx ∈ Rn, (x 6= 0)⇒ xTAx > 0.
• Matrix A ∈ Rn×n is PSD ⇔ all λi (A) ≥ 0.
• Matrix A ∈ Rn×n is PD ⇔ all λi (A) > 0.
Andre Martins (IST) Lecture 1 IST, Fall 2019 41 / 82
![Page 79: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/79.jpg)
Quadratic Forms and Positive (Semi-)DefiniteMatrices
• Given matrix A ∈ Rn×n and vector x ∈ Rn,
xTAx =n∑
i=1
n∑j=1
Ai , j xi xj ∈ R
is called a quadratic form.
• A symmetric matrix A ∈ Rn×n is positive semi-definite (PSD) if, forany x ∈ Rn, xTAx ≥ 0.
• A symmetric matrix A ∈ Rn×n is positive definite (PD) if, for anyx ∈ Rn, (x 6= 0)⇒ xTAx > 0.
• Matrix A ∈ Rn×n is PSD ⇔ all λi (A) ≥ 0.
• Matrix A ∈ Rn×n is PD ⇔ all λi (A) > 0.
Andre Martins (IST) Lecture 1 IST, Fall 2019 41 / 82
![Page 80: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/80.jpg)
Quadratic Forms and Positive (Semi-)DefiniteMatrices
• Given matrix A ∈ Rn×n and vector x ∈ Rn,
xTAx =n∑
i=1
n∑j=1
Ai , j xi xj ∈ R
is called a quadratic form.
• A symmetric matrix A ∈ Rn×n is positive semi-definite (PSD) if, forany x ∈ Rn, xTAx ≥ 0.
• A symmetric matrix A ∈ Rn×n is positive definite (PD) if, for anyx ∈ Rn, (x 6= 0)⇒ xTAx > 0.
• Matrix A ∈ Rn×n is PSD ⇔ all λi (A) ≥ 0.
• Matrix A ∈ Rn×n is PD ⇔ all λi (A) > 0.
Andre Martins (IST) Lecture 1 IST, Fall 2019 41 / 82
![Page 81: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/81.jpg)
Convex Sets
Andre Martins (IST) Lecture 1 IST, Fall 2019 42 / 82
![Page 82: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/82.jpg)
Convex Functions
Andre Martins (IST) Lecture 1 IST, Fall 2019 43 / 82
![Page 83: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/83.jpg)
Outline
1 Introduction
2 Class Administrativia
3 Recap
Linear Algebra
Probability Theory
Optimization
Andre Martins (IST) Lecture 1 IST, Fall 2019 44 / 82
![Page 84: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/84.jpg)
Probability theory
• “Essentially, all models are wrong, but some are useful”; G. Box, 1987• The study of probability has roots in games of chance (dice, cards, ...)
• Great names in science: Cardano, Fermat, Pascal, Laplace, Gauss,Huygens, Legendre, Poisson, Kolmogorov, Bernoulli, Cauchy, Gibbs,Boltzman, de Finetti, ...
• Natural tool to model uncertainty, information, knowledge, belief, ...
• ...thus also learning, decision making, inference, ...
Andre Martins (IST) Lecture 1 IST, Fall 2019 45 / 82
![Page 85: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/85.jpg)
Probability theory
• “Essentially, all models are wrong, but some are useful”; G. Box, 1987
• The study of probability has roots in games of chance (dice, cards, ...)
• Great names in science: Cardano, Fermat, Pascal, Laplace, Gauss,Huygens, Legendre, Poisson, Kolmogorov, Bernoulli, Cauchy, Gibbs,Boltzman, de Finetti, ...
• Natural tool to model uncertainty, information, knowledge, belief, ...
• ...thus also learning, decision making, inference, ...
Andre Martins (IST) Lecture 1 IST, Fall 2019 45 / 82
![Page 86: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/86.jpg)
Probability theory
• “Essentially, all models are wrong, but some are useful”; G. Box, 1987• The study of probability has roots in games of chance (dice, cards, ...)
• Great names in science: Cardano, Fermat, Pascal, Laplace, Gauss,Huygens, Legendre, Poisson, Kolmogorov, Bernoulli, Cauchy, Gibbs,Boltzman, de Finetti, ...
• Natural tool to model uncertainty, information, knowledge, belief, ...
• ...thus also learning, decision making, inference, ...
Andre Martins (IST) Lecture 1 IST, Fall 2019 45 / 82
![Page 87: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/87.jpg)
Probability theory
• “Essentially, all models are wrong, but some are useful”; G. Box, 1987• The study of probability has roots in games of chance (dice, cards, ...)
• Great names in science: Cardano, Fermat, Pascal, Laplace, Gauss,Huygens, Legendre, Poisson, Kolmogorov, Bernoulli, Cauchy, Gibbs,Boltzman, de Finetti, ...
• Natural tool to model uncertainty, information, knowledge, belief, ...
• ...thus also learning, decision making, inference, ...
Andre Martins (IST) Lecture 1 IST, Fall 2019 45 / 82
![Page 88: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/88.jpg)
Probability theory
• “Essentially, all models are wrong, but some are useful”; G. Box, 1987• The study of probability has roots in games of chance (dice, cards, ...)
• Great names in science: Cardano, Fermat, Pascal, Laplace, Gauss,Huygens, Legendre, Poisson, Kolmogorov, Bernoulli, Cauchy, Gibbs,Boltzman, de Finetti, ...
• Natural tool to model uncertainty, information, knowledge, belief, ...
• ...thus also learning, decision making, inference, ...
Andre Martins (IST) Lecture 1 IST, Fall 2019 45 / 82
![Page 89: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/89.jpg)
Probability theory
• “Essentially, all models are wrong, but some are useful”; G. Box, 1987• The study of probability has roots in games of chance (dice, cards, ...)
• Great names in science: Cardano, Fermat, Pascal, Laplace, Gauss,Huygens, Legendre, Poisson, Kolmogorov, Bernoulli, Cauchy, Gibbs,Boltzman, de Finetti, ...
• Natural tool to model uncertainty, information, knowledge, belief, ...
• ...thus also learning, decision making, inference, ...
Andre Martins (IST) Lecture 1 IST, Fall 2019 45 / 82
![Page 90: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/90.jpg)
What is probability?
• Classical definition: P(A) =NA
N
...with N mutually exclusive equally likely outcomes,
NA of which result in the occurrence of event A. Laplace, 1814
Example: P(randomly drawn card is ♣) = 13/52.
Example: P(getting 1 in throwing a fair die) = 1/6.
• Frequentist definition: P(A) = limN→∞
NA
N
...relative frequency of occurrence of A in infinite number of trials.
• Subjective probability: P(A) is a degree of belief. de Finetti, 1930s
...gives meaning to P(“Tomorrow it will rain”).
Andre Martins (IST) Lecture 1 IST, Fall 2019 46 / 82
![Page 91: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/91.jpg)
What is probability?
• Classical definition: P(A) =NA
N
...with N mutually exclusive equally likely outcomes,
NA of which result in the occurrence of event A. Laplace, 1814
Example: P(randomly drawn card is ♣) = 13/52.
Example: P(getting 1 in throwing a fair die) = 1/6.
• Frequentist definition: P(A) = limN→∞
NA
N
...relative frequency of occurrence of A in infinite number of trials.
• Subjective probability: P(A) is a degree of belief. de Finetti, 1930s
...gives meaning to P(“Tomorrow it will rain”).
Andre Martins (IST) Lecture 1 IST, Fall 2019 46 / 82
![Page 92: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/92.jpg)
What is probability?
• Classical definition: P(A) =NA
N
...with N mutually exclusive equally likely outcomes,
NA of which result in the occurrence of event A. Laplace, 1814
Example: P(randomly drawn card is ♣) = 13/52.
Example: P(getting 1 in throwing a fair die) = 1/6.
• Frequentist definition: P(A) = limN→∞
NA
N
...relative frequency of occurrence of A in infinite number of trials.
• Subjective probability: P(A) is a degree of belief. de Finetti, 1930s
...gives meaning to P(“Tomorrow it will rain”).
Andre Martins (IST) Lecture 1 IST, Fall 2019 46 / 82
![Page 93: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/93.jpg)
Key concepts: Sample space and events
• Sample space X = set of possible outcomes of a random experiment.
Examples:• Tossing two coins: X = {HH,TH,HT ,TT}• Roulette: X = {1, 2, ..., 36}• Draw a card from a shuffled deck: X = {A♣, 2♣, ...,Q♦,K♦}.
• An event A is a subset of X: A ⊆ X.
Examples:• “exactly one H in 2-coin toss”: A = {TH,HT} ⊂ {HH,TH,HT ,TT}.
• “odd number in the roulette”: B = {1, 3, ..., 35} ⊂ {1, 2, ..., 36}.
• “drawn a ♥ card”: C = {A♥, 2♥, ...,K♥} ⊂ {A♣, ...,K♦}
Andre Martins (IST) Lecture 1 IST, Fall 2019 47 / 82
![Page 94: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/94.jpg)
Key concepts: Sample space and events
• Sample space X = set of possible outcomes of a random experiment.
Examples:• Tossing two coins: X = {HH,TH,HT ,TT}• Roulette: X = {1, 2, ..., 36}• Draw a card from a shuffled deck: X = {A♣, 2♣, ...,Q♦,K♦}.
• An event A is a subset of X: A ⊆ X.
Examples:• “exactly one H in 2-coin toss”: A = {TH,HT} ⊂ {HH,TH,HT ,TT}.
• “odd number in the roulette”: B = {1, 3, ..., 35} ⊂ {1, 2, ..., 36}.
• “drawn a ♥ card”: C = {A♥, 2♥, ...,K♥} ⊂ {A♣, ...,K♦}
Andre Martins (IST) Lecture 1 IST, Fall 2019 47 / 82
![Page 95: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/95.jpg)
Kolmogorov’s Axioms for Probability
• Probability is a function that maps events A into the interval [0, 1].
Kolmogorov’s axioms (1933) for probability P
• For any A, P(A) ≥ 0
• P(X) = 1
• If A1, A2 ... ⊆ X are disjoint events, then P(⋃
i
Ai
)=∑i
P(Ai )
• From these axioms, many results can be derived. Examples:
• P(∅) = 0
• C ⊂ D ⇒ P(C ) ≤ P(D)
• P(A ∪ B) = P(A) + P(B)− P(A ∩ B)
• P(A ∪ B) ≤ P(A) + P(B) (union bound)
Andre Martins (IST) Lecture 1 IST, Fall 2019 48 / 82
![Page 96: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/96.jpg)
Kolmogorov’s Axioms for Probability
• Probability is a function that maps events A into the interval [0, 1].
Kolmogorov’s axioms (1933) for probability P• For any A, P(A) ≥ 0
• P(X) = 1
• If A1, A2 ... ⊆ X are disjoint events, then P(⋃
i
Ai
)=∑i
P(Ai )
• From these axioms, many results can be derived. Examples:
• P(∅) = 0
• C ⊂ D ⇒ P(C ) ≤ P(D)
• P(A ∪ B) = P(A) + P(B)− P(A ∩ B)
• P(A ∪ B) ≤ P(A) + P(B) (union bound)
Andre Martins (IST) Lecture 1 IST, Fall 2019 48 / 82
![Page 97: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/97.jpg)
Kolmogorov’s Axioms for Probability
• Probability is a function that maps events A into the interval [0, 1].
Kolmogorov’s axioms (1933) for probability P• For any A, P(A) ≥ 0
• P(X) = 1
• If A1, A2 ... ⊆ X are disjoint events, then P(⋃
i
Ai
)=∑i
P(Ai )
• From these axioms, many results can be derived. Examples:
• P(∅) = 0
• C ⊂ D ⇒ P(C ) ≤ P(D)
• P(A ∪ B) = P(A) + P(B)− P(A ∩ B)
• P(A ∪ B) ≤ P(A) + P(B) (union bound)
Andre Martins (IST) Lecture 1 IST, Fall 2019 48 / 82
![Page 98: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/98.jpg)
Kolmogorov’s Axioms for Probability
• Probability is a function that maps events A into the interval [0, 1].
Kolmogorov’s axioms (1933) for probability P• For any A, P(A) ≥ 0
• P(X) = 1
• If A1, A2 ... ⊆ X are disjoint events, then P(⋃
i
Ai
)=∑i
P(Ai )
• From these axioms, many results can be derived. Examples:
• P(∅) = 0
• C ⊂ D ⇒ P(C ) ≤ P(D)
• P(A ∪ B) = P(A) + P(B)− P(A ∩ B)
• P(A ∪ B) ≤ P(A) + P(B) (union bound)
Andre Martins (IST) Lecture 1 IST, Fall 2019 48 / 82
![Page 99: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/99.jpg)
Kolmogorov’s Axioms for Probability
• Probability is a function that maps events A into the interval [0, 1].
Kolmogorov’s axioms (1933) for probability P• For any A, P(A) ≥ 0
• P(X) = 1
• If A1, A2 ... ⊆ X are disjoint events, then P(⋃
i
Ai
)=∑i
P(Ai )
• From these axioms, many results can be derived. Examples:
• P(∅) = 0
• C ⊂ D ⇒ P(C ) ≤ P(D)
• P(A ∪ B) = P(A) + P(B)− P(A ∩ B)
• P(A ∪ B) ≤ P(A) + P(B) (union bound)
Andre Martins (IST) Lecture 1 IST, Fall 2019 48 / 82
![Page 100: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/100.jpg)
Conditional Probability and Independence
• If P(B) > 0, P(A|B) =P(A ∩ B)
P(B)(conditional prob. of A given B)
• ...satisfies all of Kolmogorov’s axioms:
• For any A ⊆ X, P(A|B) ≥ 0
• P(X|B) = 1
• If A1, A2, ... ⊆ X are disjoint, then
P(⋃
i
Ai
∣∣∣B) =∑i
P(Ai |B)
• Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩ B) = P(A)P(B).
Andre Martins (IST) Lecture 1 IST, Fall 2019 49 / 82
![Page 101: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/101.jpg)
Conditional Probability and Independence
• If P(B) > 0, P(A|B) =P(A ∩ B)
P(B)(conditional prob. of A given B)
• ...satisfies all of Kolmogorov’s axioms:
• For any A ⊆ X, P(A|B) ≥ 0
• P(X|B) = 1
• If A1, A2, ... ⊆ X are disjoint, then
P(⋃
i
Ai
∣∣∣B) =∑i
P(Ai |B)
• Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩ B) = P(A)P(B).
Andre Martins (IST) Lecture 1 IST, Fall 2019 49 / 82
![Page 102: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/102.jpg)
Conditional Probability and Independence
• If P(B) > 0, P(A|B) =P(A ∩ B)
P(B)(conditional prob. of A given B)
• ...satisfies all of Kolmogorov’s axioms:
• For any A ⊆ X, P(A|B) ≥ 0
• P(X|B) = 1
• If A1, A2, ... ⊆ X are disjoint, then
P(⋃
i
Ai
∣∣∣B) =∑i
P(Ai |B)
• Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩ B) = P(A)P(B).
Andre Martins (IST) Lecture 1 IST, Fall 2019 49 / 82
![Page 103: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/103.jpg)
Conditional Probability and Independence
• If P(B) > 0, P(A|B) =P(A ∩ B)
P(B)
• Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩ B) = P(A)P(B).
• Relationship with conditional probabilities:
A ⊥⊥ B ⇔ P(A|B) =P(A ∩ B)
P(B)=
P(A) P(B)
P(B)= P(A)
• Example: X = “52 cards”, A = {3♥, 3♣, 3♦, 3♠}, andB = {A♥, 2♥, ...,K♥}; then, P(A) = 1/13, P(B) = 1/4
P(A ∩ B) = P({3♥}) =1
52
P(A)P(B) =1
13
1
4=
1
52
P(A|B) = P(“3”|“♥”) =1
13= P(A)
Andre Martins (IST) Lecture 1 IST, Fall 2019 50 / 82
![Page 104: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/104.jpg)
Conditional Probability and Independence
• If P(B) > 0, P(A|B) =P(A ∩ B)
P(B)
• Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩ B) = P(A)P(B).
• Relationship with conditional probabilities:
A ⊥⊥ B ⇔ P(A|B) =P(A ∩ B)
P(B)=
P(A) P(B)
P(B)= P(A)
• Example: X = “52 cards”, A = {3♥, 3♣, 3♦, 3♠}, andB = {A♥, 2♥, ...,K♥}; then, P(A) = 1/13, P(B) = 1/4
P(A ∩ B) = P({3♥}) =1
52
P(A)P(B) =1
13
1
4=
1
52
P(A|B) = P(“3”|“♥”) =1
13= P(A)
Andre Martins (IST) Lecture 1 IST, Fall 2019 50 / 82
![Page 105: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/105.jpg)
Conditional Probability and Independence
• If P(B) > 0, P(A|B) =P(A ∩ B)
P(B)
• Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩ B) = P(A)P(B).
• Relationship with conditional probabilities:
A ⊥⊥ B ⇔ P(A|B) =P(A ∩ B)
P(B)=
P(A) P(B)
P(B)= P(A)
• Example: X = “52 cards”, A = {3♥, 3♣, 3♦, 3♠}, andB = {A♥, 2♥, ...,K♥}; then, P(A) = 1/13, P(B) = 1/4
P(A ∩ B) = P({3♥}) =1
52
P(A)P(B) =1
13
1
4=
1
52
P(A|B) = P(“3”|“♥”) =1
13= P(A)
Andre Martins (IST) Lecture 1 IST, Fall 2019 50 / 82
![Page 106: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/106.jpg)
Conditional Probability and Independence
• If P(B) > 0, P(A|B) =P(A ∩ B)
P(B)
• Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩ B) = P(A)P(B).
• Relationship with conditional probabilities:
A ⊥⊥ B ⇔ P(A|B) =P(A ∩ B)
P(B)=
P(A) P(B)
P(B)= P(A)
• Example: X = “52 cards”, A = {3♥, 3♣, 3♦, 3♠}, andB = {A♥, 2♥, ...,K♥}; then, P(A) = 1/13, P(B) = 1/4
P(A ∩ B) = P({3♥}) =1
52
P(A)P(B) =1
13
1
4=
1
52
P(A|B) = P(“3”|“♥”) =1
13= P(A)
Andre Martins (IST) Lecture 1 IST, Fall 2019 50 / 82
![Page 107: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/107.jpg)
Conditional Probability and Independence
• If P(B) > 0, P(A|B) =P(A ∩ B)
P(B)
• Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩ B) = P(A)P(B).
• Relationship with conditional probabilities:
A ⊥⊥ B ⇔ P(A|B) =P(A ∩ B)
P(B)=
P(A) P(B)
P(B)= P(A)
• Example: X = “52 cards”, A = {3♥, 3♣, 3♦, 3♠}, andB = {A♥, 2♥, ...,K♥}; then, P(A) = 1/13, P(B) = 1/4
P(A ∩ B) = P({3♥}) =1
52
P(A)P(B) =1
13
1
4=
1
52
P(A|B) = P(“3”|“♥”) =1
13= P(A)
Andre Martins (IST) Lecture 1 IST, Fall 2019 50 / 82
![Page 108: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/108.jpg)
Conditional Probability and Independence
• If P(B) > 0, P(A|B) =P(A ∩ B)
P(B)
• Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩ B) = P(A)P(B).
• Relationship with conditional probabilities:
A ⊥⊥ B ⇔ P(A|B) =P(A ∩ B)
P(B)=
P(A) P(B)
P(B)= P(A)
• Example: X = “52 cards”, A = {3♥, 3♣, 3♦, 3♠}, andB = {A♥, 2♥, ...,K♥}; then, P(A) = 1/13, P(B) = 1/4
P(A ∩ B) = P({3♥}) =1
52
P(A)P(B) =1
13
1
4=
1
52
P(A|B) = P(“3”|“♥”) =1
13= P(A)
Andre Martins (IST) Lecture 1 IST, Fall 2019 50 / 82
![Page 109: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/109.jpg)
Bayes Theorem
• Law of total probability: if A1, ...,An are a partition of X
P(B) =∑i
P(B|Ai )P(Ai )
=∑i
P(B ∩ Ai )
• Bayes’ theorem: if {A1, ...,An} is a partition of X
P(Ai |B) =P(B ∩ Ai )
P(B)=
P(B|Ai ) P(Ai )∑j
P(B|Aj)P(Aj)
Andre Martins (IST) Lecture 1 IST, Fall 2019 51 / 82
![Page 110: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/110.jpg)
Bayes Theorem
• Law of total probability: if A1, ...,An are a partition of X
P(B) =∑i
P(B|Ai )P(Ai )
=∑i
P(B ∩ Ai )
• Bayes’ theorem: if {A1, ...,An} is a partition of X
P(Ai |B) =P(B ∩ Ai )
P(B)=
P(B|Ai ) P(Ai )∑j
P(B|Aj)P(Aj)
Andre Martins (IST) Lecture 1 IST, Fall 2019 51 / 82
![Page 111: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/111.jpg)
Random Variables
• A (real) random variable (RV) is a function: X : X→ R
• Discrete RV: range of X is countable (e.g., N or {0, 1})• Continuous RV: range of X is uncountable (e.g., R or [0, 1])
• Example: number of head in tossing two coins,X = {HH,HT ,TH,TT},X (HH) = 2, X (HT ) = X (TH) = 1, X (TT ) = 0.Range of X = {0, 1, 2}.
• Example: distance traveled by a tossed coin; range of X = R+.
Andre Martins (IST) Lecture 1 IST, Fall 2019 52 / 82
![Page 112: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/112.jpg)
Random Variables
• A (real) random variable (RV) is a function: X : X→ R
• Discrete RV: range of X is countable (e.g., N or {0, 1})
• Continuous RV: range of X is uncountable (e.g., R or [0, 1])
• Example: number of head in tossing two coins,X = {HH,HT ,TH,TT},X (HH) = 2, X (HT ) = X (TH) = 1, X (TT ) = 0.Range of X = {0, 1, 2}.
• Example: distance traveled by a tossed coin; range of X = R+.
Andre Martins (IST) Lecture 1 IST, Fall 2019 52 / 82
![Page 113: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/113.jpg)
Random Variables
• A (real) random variable (RV) is a function: X : X→ R
• Discrete RV: range of X is countable (e.g., N or {0, 1})• Continuous RV: range of X is uncountable (e.g., R or [0, 1])
• Example: number of head in tossing two coins,X = {HH,HT ,TH,TT},X (HH) = 2, X (HT ) = X (TH) = 1, X (TT ) = 0.Range of X = {0, 1, 2}.
• Example: distance traveled by a tossed coin; range of X = R+.
Andre Martins (IST) Lecture 1 IST, Fall 2019 52 / 82
![Page 114: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/114.jpg)
Random Variables
• A (real) random variable (RV) is a function: X : X→ R
• Discrete RV: range of X is countable (e.g., N or {0, 1})• Continuous RV: range of X is uncountable (e.g., R or [0, 1])
• Example: number of head in tossing two coins,X = {HH,HT ,TH,TT},X (HH) = 2, X (HT ) = X (TH) = 1, X (TT ) = 0.Range of X = {0, 1, 2}.
• Example: distance traveled by a tossed coin; range of X = R+.
Andre Martins (IST) Lecture 1 IST, Fall 2019 52 / 82
![Page 115: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/115.jpg)
Random Variables
• A (real) random variable (RV) is a function: X : X→ R
• Discrete RV: range of X is countable (e.g., N or {0, 1})• Continuous RV: range of X is uncountable (e.g., R or [0, 1])
• Example: number of head in tossing two coins,X = {HH,HT ,TH,TT},X (HH) = 2, X (HT ) = X (TH) = 1, X (TT ) = 0.Range of X = {0, 1, 2}.
• Example: distance traveled by a tossed coin; range of X = R+.
Andre Martins (IST) Lecture 1 IST, Fall 2019 52 / 82
![Page 116: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/116.jpg)
Random Variables: Distribution Function
• Distribution function: FX (x) = P({ω ∈ X : X (ω) ≤ x})
• Example: number of heads in tossing 2 coins; range(X ) = {0, 1, 2}.
• Probability mass function (discrete RV): fX (x) = P(X = x),
FX (x) =∑xi≤x
fX (xi ).
Andre Martins (IST) Lecture 1 IST, Fall 2019 53 / 82
![Page 117: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/117.jpg)
Random Variables: Distribution Function
• Distribution function: FX (x) = P({ω ∈ X : X (ω) ≤ x})
• Example: number of heads in tossing 2 coins; range(X ) = {0, 1, 2}.
• Probability mass function (discrete RV): fX (x) = P(X = x),
FX (x) =∑xi≤x
fX (xi ).
Andre Martins (IST) Lecture 1 IST, Fall 2019 53 / 82
![Page 118: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/118.jpg)
Random Variables: Distribution Function
• Distribution function: FX (x) = P({ω ∈ X : X (ω) ≤ x})
• Example: number of heads in tossing 2 coins; range(X ) = {0, 1, 2}.
• Probability mass function (discrete RV): fX (x) = P(X = x),
FX (x) =∑xi≤x
fX (xi ).
Andre Martins (IST) Lecture 1 IST, Fall 2019 53 / 82
![Page 119: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/119.jpg)
Important Discrete Random Variables
• Uniform: X ∈ {x1, ..., xK}, pmf fX (xi ) = 1/K .
• Bernoulli RV: X ∈ {0, 1}, pmf fX (x) =
{p ⇐ x = 1
1− p ⇐ x = 0
Can be written compactly as fX (x) = px(1− p)1−x .
• Binomial RV: X ∈ {0, 1, ..., n} (sum on n Bernoulli RVs)
fX (x) = Binomial(x ; n, p) =
(n
x
)px (1− p)(n−x)
Binomial coefficients(“n choose x”):(
n
x
)=
n!
(n − x)! x!
Andre Martins (IST) Lecture 1 IST, Fall 2019 54 / 82
![Page 120: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/120.jpg)
Important Discrete Random Variables
• Uniform: X ∈ {x1, ..., xK}, pmf fX (xi ) = 1/K .
• Bernoulli RV: X ∈ {0, 1}, pmf fX (x) =
{p ⇐ x = 1
1− p ⇐ x = 0
Can be written compactly as fX (x) = px(1− p)1−x .
• Binomial RV: X ∈ {0, 1, ..., n} (sum on n Bernoulli RVs)
fX (x) = Binomial(x ; n, p) =
(n
x
)px (1− p)(n−x)
Binomial coefficients(“n choose x”):(
n
x
)=
n!
(n − x)! x!
Andre Martins (IST) Lecture 1 IST, Fall 2019 54 / 82
![Page 121: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/121.jpg)
Important Discrete Random Variables
• Uniform: X ∈ {x1, ..., xK}, pmf fX (xi ) = 1/K .
• Bernoulli RV: X ∈ {0, 1}, pmf fX (x) =
{p ⇐ x = 1
1− p ⇐ x = 0
Can be written compactly as fX (x) = px(1− p)1−x .
• Binomial RV: X ∈ {0, 1, ..., n} (sum on n Bernoulli RVs)
fX (x) = Binomial(x ; n, p) =
(n
x
)px (1− p)(n−x)
Binomial coefficients(“n choose x”):(
n
x
)=
n!
(n − x)! x!
Andre Martins (IST) Lecture 1 IST, Fall 2019 54 / 82
![Page 122: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/122.jpg)
Important Discrete Random Variables
• Uniform: X ∈ {x1, ..., xK}, pmf fX (xi ) = 1/K .
• Bernoulli RV: X ∈ {0, 1}, pmf fX (x) =
{p ⇐ x = 1
1− p ⇐ x = 0
Can be written compactly as fX (x) = px(1− p)1−x .
• Binomial RV: X ∈ {0, 1, ..., n} (sum on n Bernoulli RVs)
fX (x) = Binomial(x ; n, p) =
(n
x
)px (1− p)(n−x)
Binomial coefficients(“n choose x”):(
n
x
)=
n!
(n − x)! x!
Andre Martins (IST) Lecture 1 IST, Fall 2019 54 / 82
![Page 123: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/123.jpg)
More Important Discrete Random Variables
• Geometric(p): X ∈ N, pmf fX (x) = p(1− p)x−1.(e.g., number of trials until the first success).
• Poisson(λ): X ∈ N ∪ {0}, pmf fX (x) =e−λλx
x!
Notice that∑∞
x=0λx
x! = eλ, thus∑∞
x=0 fX (x) = 1.
“...probability of the number of independent occurrences in a fixed(time/space) interval if these occurrences have known average rate”
Andre Martins (IST) Lecture 1 IST, Fall 2019 55 / 82
![Page 124: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/124.jpg)
More Important Discrete Random Variables
• Geometric(p): X ∈ N, pmf fX (x) = p(1− p)x−1.(e.g., number of trials until the first success).
• Poisson(λ): X ∈ N ∪ {0}, pmf fX (x) =e−λλx
x!
Notice that∑∞
x=0λx
x! = eλ, thus∑∞
x=0 fX (x) = 1.
“...probability of the number of independent occurrences in a fixed(time/space) interval if these occurrences have known average rate”
Andre Martins (IST) Lecture 1 IST, Fall 2019 55 / 82
![Page 125: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/125.jpg)
Random Variables: Distribution Function
• Distribution function: FX (x) = P({ω ∈ X : X (ω) ≤ x})
• Example: continuous RV with uniform distribution on [a, b].
• Probability density function (pdf, continuous RV): fX (x)
FX (x) =
∫ x
−∞fX (u) du, P(X ∈ [c , d ]) =
∫ d
cfX (x) dx , P(X =x) = 0
Andre Martins (IST) Lecture 1 IST, Fall 2019 56 / 82
![Page 126: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/126.jpg)
Random Variables: Distribution Function
• Distribution function: FX (x) = P({ω ∈ X : X (ω) ≤ x})
• Example: continuous RV with uniform distribution on [a, b].
• Probability density function (pdf, continuous RV): fX (x)
FX (x) =
∫ x
−∞fX (u) du, P(X ∈ [c , d ]) =
∫ d
cfX (x) dx , P(X =x) = 0
Andre Martins (IST) Lecture 1 IST, Fall 2019 56 / 82
![Page 127: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/127.jpg)
Random Variables: Distribution Function
• Distribution function: FX (x) = P({ω ∈ X : X (ω) ≤ x})
• Example: continuous RV with uniform distribution on [a, b].
• Probability density function (pdf, continuous RV): fX (x)
FX (x) =
∫ x
−∞fX (u) du, P(X ∈ [c , d ]) =
∫ d
cfX (x) dx , P(X =x) = 0
Andre Martins (IST) Lecture 1 IST, Fall 2019 56 / 82
![Page 128: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/128.jpg)
Random Variables: Distribution Function
• Distribution function: FX (x) = P({ω ∈ X : X (ω) ≤ x})
• Example: continuous RV with uniform distribution on [a, b].
• Probability density function (pdf, continuous RV): fX (x)
FX (x) =
∫ x
−∞fX (u) du,
P(X ∈ [c , d ]) =
∫ d
cfX (x) dx , P(X =x) = 0
Andre Martins (IST) Lecture 1 IST, Fall 2019 56 / 82
![Page 129: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/129.jpg)
Random Variables: Distribution Function
• Distribution function: FX (x) = P({ω ∈ X : X (ω) ≤ x})
• Example: continuous RV with uniform distribution on [a, b].
• Probability density function (pdf, continuous RV): fX (x)
FX (x) =
∫ x
−∞fX (u) du, P(X ∈ [c , d ]) =
∫ d
cfX (x) dx ,
P(X =x) = 0
Andre Martins (IST) Lecture 1 IST, Fall 2019 56 / 82
![Page 130: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/130.jpg)
Random Variables: Distribution Function
• Distribution function: FX (x) = P({ω ∈ X : X (ω) ≤ x})
• Example: continuous RV with uniform distribution on [a, b].
• Probability density function (pdf, continuous RV): fX (x)
FX (x) =
∫ x
−∞fX (u) du, P(X ∈ [c , d ]) =
∫ d
cfX (x) dx , P(X =x) = 0
Andre Martins (IST) Lecture 1 IST, Fall 2019 56 / 82
![Page 131: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/131.jpg)
Important Continuous Random Variables
• Uniform: fX (x) = Uniform(x ; a, b) =
{1
b−a ⇐ x ∈ [a, b]
0 ⇐ x 6∈ [a, b](previous slide).
• Gaussian: fX (x) = N(x ;µ, σ2) =1√
2π σ2e−
(x−µ)2
2σ2
• Exponential: fX (x) = Exp(x ;λ) =
{λe−λ x ⇐ x ≥ 0
0 ⇐ x < 0
Andre Martins (IST) Lecture 1 IST, Fall 2019 57 / 82
![Page 132: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/132.jpg)
Important Continuous Random Variables
• Uniform: fX (x) = Uniform(x ; a, b) =
{1
b−a ⇐ x ∈ [a, b]
0 ⇐ x 6∈ [a, b](previous slide).
• Gaussian: fX (x) = N(x ;µ, σ2) =1√
2π σ2e−
(x−µ)2
2σ2
• Exponential: fX (x) = Exp(x ;λ) =
{λe−λ x ⇐ x ≥ 0
0 ⇐ x < 0
Andre Martins (IST) Lecture 1 IST, Fall 2019 57 / 82
![Page 133: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/133.jpg)
Important Continuous Random Variables
• Uniform: fX (x) = Uniform(x ; a, b) =
{1
b−a ⇐ x ∈ [a, b]
0 ⇐ x 6∈ [a, b](previous slide).
• Gaussian: fX (x) = N(x ;µ, σ2) =1√
2π σ2e−
(x−µ)2
2σ2
• Exponential: fX (x) = Exp(x ;λ) =
{λe−λ x ⇐ x ≥ 0
0 ⇐ x < 0
Andre Martins (IST) Lecture 1 IST, Fall 2019 57 / 82
![Page 134: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/134.jpg)
Expectation of Random Variables
• Expectation: E(X ) =
∑i
xi fX (xi ) X ∈ {x1, ..., xK} ⊂ R∫ ∞−∞
x fX (x) dx X continuous
• Example: Bernoulli, fX (x) = px (1− p)1−x , for x ∈ {0, 1}.
E(X ) = 0 (1− p) + 1 p = p.
• Example: Binomial, fX (x) =(nx
)px (1− p)n−x , for x ∈ {0, ..., n}.
E(X ) = n p.
• Example: Gaussian, fX (x) = N(x ;µ, σ2). E(X ) = µ.
• Linearity of expectation:E(X + Y ) = E(X ) + E(Y ); E(αX ) = αE(X ), α ∈ R
Andre Martins (IST) Lecture 1 IST, Fall 2019 58 / 82
![Page 135: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/135.jpg)
Expectation of Random Variables
• Expectation: E(X ) =
∑i
xi fX (xi ) X ∈ {x1, ..., xK} ⊂ R∫ ∞−∞
x fX (x) dx X continuous
• Example: Bernoulli, fX (x) = px (1− p)1−x , for x ∈ {0, 1}.
E(X ) = 0 (1− p) + 1 p = p.
• Example: Binomial, fX (x) =(nx
)px (1− p)n−x , for x ∈ {0, ..., n}.
E(X ) = n p.
• Example: Gaussian, fX (x) = N(x ;µ, σ2). E(X ) = µ.
• Linearity of expectation:E(X + Y ) = E(X ) + E(Y ); E(αX ) = αE(X ), α ∈ R
Andre Martins (IST) Lecture 1 IST, Fall 2019 58 / 82
![Page 136: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/136.jpg)
Expectation of Random Variables
• Expectation: E(X ) =
∑i
xi fX (xi ) X ∈ {x1, ..., xK} ⊂ R∫ ∞−∞
x fX (x) dx X continuous
• Example: Bernoulli, fX (x) = px (1− p)1−x , for x ∈ {0, 1}.
E(X ) = 0 (1− p) + 1 p = p.
• Example: Binomial, fX (x) =(nx
)px (1− p)n−x , for x ∈ {0, ..., n}.
E(X ) = n p.
• Example: Gaussian, fX (x) = N(x ;µ, σ2). E(X ) = µ.
• Linearity of expectation:E(X + Y ) = E(X ) + E(Y ); E(αX ) = αE(X ), α ∈ R
Andre Martins (IST) Lecture 1 IST, Fall 2019 58 / 82
![Page 137: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/137.jpg)
Expectation of Random Variables
• Expectation: E(X ) =
∑i
xi fX (xi ) X ∈ {x1, ..., xK} ⊂ R∫ ∞−∞
x fX (x) dx X continuous
• Example: Bernoulli, fX (x) = px (1− p)1−x , for x ∈ {0, 1}.
E(X ) = 0 (1− p) + 1 p = p.
• Example: Binomial, fX (x) =(nx
)px (1− p)n−x , for x ∈ {0, ..., n}.
E(X ) = n p.
• Example: Gaussian, fX (x) = N(x ;µ, σ2). E(X ) = µ.
• Linearity of expectation:E(X + Y ) = E(X ) + E(Y ); E(αX ) = αE(X ), α ∈ R
Andre Martins (IST) Lecture 1 IST, Fall 2019 58 / 82
![Page 138: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/138.jpg)
Expectation of Random Variables
• Expectation: E(X ) =
∑i
xi fX (xi ) X ∈ {x1, ..., xK} ⊂ R∫ ∞−∞
x fX (x) dx X continuous
• Example: Bernoulli, fX (x) = px (1− p)1−x , for x ∈ {0, 1}.
E(X ) = 0 (1− p) + 1 p = p.
• Example: Binomial, fX (x) =(nx
)px (1− p)n−x , for x ∈ {0, ..., n}.
E(X ) = n p.
• Example: Gaussian, fX (x) = N(x ;µ, σ2). E(X ) = µ.
• Linearity of expectation:E(X + Y ) = E(X ) + E(Y ); E(αX ) = αE(X ), α ∈ R
Andre Martins (IST) Lecture 1 IST, Fall 2019 58 / 82
![Page 139: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/139.jpg)
Expectation of Functions of Random Variables
• E(g(X )) =
∑i
g(xi )fX (xi ) X discrete, g(xi ) ∈ R∫ ∞−∞
g(x) fX (x) dx X continuous
• Example: variance, var(X ) = E((
X − E(X ))2)
= E(X 2)− E(X )2
• Example: Bernoulli variance, E(X 2) = E(X ) = p
, thus var(X ) = p(1− p).
• Example: Gaussian variance, E((X − µ)2
)= σ2.
• Probability as expectation of indicator, 1A(x) =
{1 ⇐ x ∈ A0 ⇐ x 6∈ A
P(X ∈ A) =
∫AfX (x) dx =
∫1A(x) fX (x) dx = E(1A(X ))
Andre Martins (IST) Lecture 1 IST, Fall 2019 59 / 82
![Page 140: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/140.jpg)
Expectation of Functions of Random Variables
• E(g(X )) =
∑i
g(xi )fX (xi ) X discrete, g(xi ) ∈ R∫ ∞−∞
g(x) fX (x) dx X continuous
• Example: variance, var(X ) = E((
X − E(X ))2)
= E(X 2)− E(X )2
• Example: Bernoulli variance, E(X 2) = E(X ) = p
, thus var(X ) = p(1− p).
• Example: Gaussian variance, E((X − µ)2
)= σ2.
• Probability as expectation of indicator, 1A(x) =
{1 ⇐ x ∈ A0 ⇐ x 6∈ A
P(X ∈ A) =
∫AfX (x) dx =
∫1A(x) fX (x) dx = E(1A(X ))
Andre Martins (IST) Lecture 1 IST, Fall 2019 59 / 82
![Page 141: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/141.jpg)
Expectation of Functions of Random Variables
• E(g(X )) =
∑i
g(xi )fX (xi ) X discrete, g(xi ) ∈ R∫ ∞−∞
g(x) fX (x) dx X continuous
• Example: variance, var(X ) = E((
X − E(X ))2)
= E(X 2)− E(X )2
• Example: Bernoulli variance, E(X 2) = E(X ) = p
, thus var(X ) = p(1− p).
• Example: Gaussian variance, E((X − µ)2
)= σ2.
• Probability as expectation of indicator, 1A(x) =
{1 ⇐ x ∈ A0 ⇐ x 6∈ A
P(X ∈ A) =
∫AfX (x) dx =
∫1A(x) fX (x) dx = E(1A(X ))
Andre Martins (IST) Lecture 1 IST, Fall 2019 59 / 82
![Page 142: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/142.jpg)
Expectation of Functions of Random Variables
• E(g(X )) =
∑i
g(xi )fX (xi ) X discrete, g(xi ) ∈ R∫ ∞−∞
g(x) fX (x) dx X continuous
• Example: variance, var(X ) = E((
X − E(X ))2)
= E(X 2)− E(X )2
• Example: Bernoulli variance, E(X 2) = E(X ) = p , thus var(X ) = p(1− p).
• Example: Gaussian variance, E((X − µ)2
)= σ2.
• Probability as expectation of indicator, 1A(x) =
{1 ⇐ x ∈ A0 ⇐ x 6∈ A
P(X ∈ A) =
∫AfX (x) dx =
∫1A(x) fX (x) dx = E(1A(X ))
Andre Martins (IST) Lecture 1 IST, Fall 2019 59 / 82
![Page 143: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/143.jpg)
Expectation of Functions of Random Variables
• E(g(X )) =
∑i
g(xi )fX (xi ) X discrete, g(xi ) ∈ R∫ ∞−∞
g(x) fX (x) dx X continuous
• Example: variance, var(X ) = E((
X − E(X ))2)
= E(X 2)− E(X )2
• Example: Bernoulli variance, E(X 2) = E(X ) = p , thus var(X ) = p(1− p).
• Example: Gaussian variance, E((X − µ)2
)= σ2.
• Probability as expectation of indicator, 1A(x) =
{1 ⇐ x ∈ A0 ⇐ x 6∈ A
P(X ∈ A) =
∫AfX (x) dx =
∫1A(x) fX (x) dx = E(1A(X ))
Andre Martins (IST) Lecture 1 IST, Fall 2019 59 / 82
![Page 144: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/144.jpg)
Expectation of Functions of Random Variables
• E(g(X )) =
∑i
g(xi )fX (xi ) X discrete, g(xi ) ∈ R∫ ∞−∞
g(x) fX (x) dx X continuous
• Example: variance, var(X ) = E((
X − E(X ))2)
= E(X 2)− E(X )2
• Example: Bernoulli variance, E(X 2) = E(X ) = p , thus var(X ) = p(1− p).
• Example: Gaussian variance, E((X − µ)2
)= σ2.
• Probability as expectation of indicator, 1A(x) =
{1 ⇐ x ∈ A0 ⇐ x 6∈ A
P(X ∈ A) =
∫AfX (x) dx =
∫1A(x) fX (x) dx = E(1A(X ))
Andre Martins (IST) Lecture 1 IST, Fall 2019 59 / 82
![Page 145: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/145.jpg)
Two (or More) Random Variables
• Joint pmf of two discrete RVs: fX ,Y (x , y) = P(X = x ∧ Y = y).
Extends trivially to more than two RVs.
• Joint pdf of two continuous RVs: fX ,Y (x , y), such that
P((X ,Y ) ∈ A
)=
∫ ∫AfX ,Y (x , y) dx dy .
Extends trivially to more than two RVs.
• Marginalization: fY (y) =
∑x
fX ,Y (x , y), if X is discrete∫ ∞−∞
fX ,Y (x , y) dx , if X continuous
• Independence:
X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)
⇒6⇐ E(X Y ) = E(X )E(Y )
.
Andre Martins (IST) Lecture 1 IST, Fall 2019 60 / 82
![Page 146: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/146.jpg)
Two (or More) Random Variables
• Joint pmf of two discrete RVs: fX ,Y (x , y) = P(X = x ∧ Y = y).
Extends trivially to more than two RVs.
• Joint pdf of two continuous RVs: fX ,Y (x , y), such that
P((X ,Y ) ∈ A
)=
∫ ∫AfX ,Y (x , y) dx dy .
Extends trivially to more than two RVs.
• Marginalization: fY (y) =
∑x
fX ,Y (x , y), if X is discrete∫ ∞−∞
fX ,Y (x , y) dx , if X continuous
• Independence:
X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)
⇒6⇐ E(X Y ) = E(X )E(Y )
.
Andre Martins (IST) Lecture 1 IST, Fall 2019 60 / 82
![Page 147: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/147.jpg)
Two (or More) Random Variables
• Joint pmf of two discrete RVs: fX ,Y (x , y) = P(X = x ∧ Y = y).
Extends trivially to more than two RVs.
• Joint pdf of two continuous RVs: fX ,Y (x , y), such that
P((X ,Y ) ∈ A
)=
∫ ∫AfX ,Y (x , y) dx dy .
Extends trivially to more than two RVs.
• Marginalization: fY (y) =
∑x
fX ,Y (x , y), if X is discrete∫ ∞−∞
fX ,Y (x , y) dx , if X continuous
• Independence:
X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)
⇒6⇐ E(X Y ) = E(X )E(Y )
.
Andre Martins (IST) Lecture 1 IST, Fall 2019 60 / 82
![Page 148: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/148.jpg)
Two (or More) Random Variables
• Joint pmf of two discrete RVs: fX ,Y (x , y) = P(X = x ∧ Y = y).
Extends trivially to more than two RVs.
• Joint pdf of two continuous RVs: fX ,Y (x , y), such that
P((X ,Y ) ∈ A
)=
∫ ∫AfX ,Y (x , y) dx dy .
Extends trivially to more than two RVs.
• Marginalization: fY (y) =
∑x
fX ,Y (x , y), if X is discrete∫ ∞−∞
fX ,Y (x , y) dx , if X continuous
• Independence:
X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)
⇒6⇐ E(X Y ) = E(X )E(Y )
.
Andre Martins (IST) Lecture 1 IST, Fall 2019 60 / 82
![Page 149: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/149.jpg)
Two (or More) Random Variables
• Joint pmf of two discrete RVs: fX ,Y (x , y) = P(X = x ∧ Y = y).
Extends trivially to more than two RVs.
• Joint pdf of two continuous RVs: fX ,Y (x , y), such that
P((X ,Y ) ∈ A
)=
∫ ∫AfX ,Y (x , y) dx dy .
Extends trivially to more than two RVs.
• Marginalization: fY (y) =
∑x
fX ,Y (x , y), if X is discrete∫ ∞−∞
fX ,Y (x , y) dx , if X continuous
• Independence:
X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)⇒6⇐ E(X Y ) = E(X )E(Y ).
Andre Martins (IST) Lecture 1 IST, Fall 2019 60 / 82
![Page 150: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/150.jpg)
Conditionals and Bayes’ Theorem
• Conditional pmf (discrete RVs):
fX |Y (x |y) = P(X = x |Y = y) =P(X = x ∧ Y = y)
P(Y = y)=
fX ,Y (x , y)
fY (y).
• Conditional pdf (continuous RVs): fX |Y (x |y) =fX ,Y (x , y)
fY (y)...the meaning is technically delicate.
• Bayes’ theorem: fX |Y (x |y) =fY |X (y |x) fX (x)
fY (y)(pdf or pmf).
• Also valid in the mixed case (e.g., X continuous, Y discrete).
Andre Martins (IST) Lecture 1 IST, Fall 2019 61 / 82
![Page 151: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/151.jpg)
Conditionals and Bayes’ Theorem
• Conditional pmf (discrete RVs):
fX |Y (x |y) = P(X = x |Y = y) =P(X = x ∧ Y = y)
P(Y = y)=
fX ,Y (x , y)
fY (y).
• Conditional pdf (continuous RVs): fX |Y (x |y) =fX ,Y (x , y)
fY (y)...the meaning is technically delicate.
• Bayes’ theorem: fX |Y (x |y) =fY |X (y |x) fX (x)
fY (y)(pdf or pmf).
• Also valid in the mixed case (e.g., X continuous, Y discrete).
Andre Martins (IST) Lecture 1 IST, Fall 2019 61 / 82
![Page 152: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/152.jpg)
Conditionals and Bayes’ Theorem
• Conditional pmf (discrete RVs):
fX |Y (x |y) = P(X = x |Y = y) =P(X = x ∧ Y = y)
P(Y = y)=
fX ,Y (x , y)
fY (y).
• Conditional pdf (continuous RVs): fX |Y (x |y) =fX ,Y (x , y)
fY (y)...the meaning is technically delicate.
• Bayes’ theorem: fX |Y (x |y) =fY |X (y |x) fX (x)
fY (y)(pdf or pmf).
• Also valid in the mixed case (e.g., X continuous, Y discrete).
Andre Martins (IST) Lecture 1 IST, Fall 2019 61 / 82
![Page 153: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/153.jpg)
Conditionals and Bayes’ Theorem
• Conditional pmf (discrete RVs):
fX |Y (x |y) = P(X = x |Y = y) =P(X = x ∧ Y = y)
P(Y = y)=
fX ,Y (x , y)
fY (y).
• Conditional pdf (continuous RVs): fX |Y (x |y) =fX ,Y (x , y)
fY (y)...the meaning is technically delicate.
• Bayes’ theorem: fX |Y (x |y) =fY |X (y |x) fX (x)
fY (y)(pdf or pmf).
• Also valid in the mixed case (e.g., X continuous, Y discrete).
Andre Martins (IST) Lecture 1 IST, Fall 2019 61 / 82
![Page 154: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/154.jpg)
Joint, Marginal, and Conditional Probabilities: AnExample
• A pair of binary variables X ,Y ∈ {0, 1}, with joint pmf:
• Marginals: fX (0) = 15 + 2
5 = 35 , fX (1) = 1
10 + 310 = 4
10 ,
fY (0) = 15 + 1
10 = 310 , fY (1) = 2
5 + 310 = 7
10 .
• Conditional probabilities:
Andre Martins (IST) Lecture 1 IST, Fall 2019 62 / 82
![Page 155: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/155.jpg)
Joint, Marginal, and Conditional Probabilities: AnExample
• A pair of binary variables X ,Y ∈ {0, 1}, with joint pmf:
• Marginals: fX (0) = 15 + 2
5 = 35 , fX (1) = 1
10 + 310 = 4
10 ,
fY (0) = 15 + 1
10 = 310 , fY (1) = 2
5 + 310 = 7
10 .
• Conditional probabilities:
Andre Martins (IST) Lecture 1 IST, Fall 2019 62 / 82
![Page 156: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/156.jpg)
Joint, Marginal, and Conditional Probabilities: AnExample
• A pair of binary variables X ,Y ∈ {0, 1}, with joint pmf:
• Marginals: fX (0) = 15 + 2
5 = 35 , fX (1) = 1
10 + 310 = 4
10 ,
fY (0) = 15 + 1
10 = 310 , fY (1) = 2
5 + 310 = 7
10 .
• Conditional probabilities:
Andre Martins (IST) Lecture 1 IST, Fall 2019 62 / 82
![Page 157: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/157.jpg)
An Important Multivariate RV: Multinomial
• Multinomial: X = (X1, ...,XK ), Xi ∈ {0, ..., n}, such that∑
i Xi = n,
fX (x1, ..., xK ) =
{ ( nx1 x2 ··· xK
)px11 px22 · · · p
xKk ⇐
∑i xi = n
0 ⇐∑
i xi 6= n(n
x1 x2 · · · xK
)=
n!
x1! x2! · · · xK !
Parameters: p1, ..., pK ≥ 0, such that∑
i pi = 1.
• Generalizes the binomial from binary to K -classes.
• Example: tossing n independent fair dice, p1 = · · · = p6 = 1/6.xi = number of outcomes with i dots. Of course,
∑i xi = n.
Andre Martins (IST) Lecture 1 IST, Fall 2019 63 / 82
![Page 158: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/158.jpg)
An Important Multivariate RV: Multinomial
• Multinomial: X = (X1, ...,XK ), Xi ∈ {0, ..., n}, such that∑
i Xi = n,
fX (x1, ..., xK ) =
{ ( nx1 x2 ··· xK
)px11 px22 · · · p
xKk ⇐
∑i xi = n
0 ⇐∑
i xi 6= n(n
x1 x2 · · · xK
)=
n!
x1! x2! · · · xK !
Parameters: p1, ..., pK ≥ 0, such that∑
i pi = 1.
• Generalizes the binomial from binary to K -classes.
• Example: tossing n independent fair dice, p1 = · · · = p6 = 1/6.xi = number of outcomes with i dots. Of course,
∑i xi = n.
Andre Martins (IST) Lecture 1 IST, Fall 2019 63 / 82
![Page 159: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/159.jpg)
An Important Multivariate RV: Multinomial
• Multinomial: X = (X1, ...,XK ), Xi ∈ {0, ..., n}, such that∑
i Xi = n,
fX (x1, ..., xK ) =
{ ( nx1 x2 ··· xK
)px11 px22 · · · p
xKk ⇐
∑i xi = n
0 ⇐∑
i xi 6= n(n
x1 x2 · · · xK
)=
n!
x1! x2! · · · xK !
Parameters: p1, ..., pK ≥ 0, such that∑
i pi = 1.
• Generalizes the binomial from binary to K -classes.
• Example: tossing n independent fair dice, p1 = · · · = p6 = 1/6.xi = number of outcomes with i dots. Of course,
∑i xi = n.
Andre Martins (IST) Lecture 1 IST, Fall 2019 63 / 82
![Page 160: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/160.jpg)
An Important Multivariate RV: Gaussian
• Multivariate Gaussian: X ∈ Rn,
fX (x) = N(x ;µ,C ) =1√
det(2π C )exp
(−1
2(x − µ)TC−1(x − µ)
)
• Parameters: vector µ ∈ Rn and matrix C ∈ Rn×n.Expected value: E(X ) = µ. Meaning of C : next slide.
Andre Martins (IST) Lecture 1 IST, Fall 2019 64 / 82
![Page 161: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/161.jpg)
An Important Multivariate RV: Gaussian
• Multivariate Gaussian: X ∈ Rn,
fX (x) = N(x ;µ,C ) =1√
det(2π C )exp
(−1
2(x − µ)TC−1(x − µ)
)
• Parameters: vector µ ∈ Rn and matrix C ∈ Rn×n.Expected value: E(X ) = µ. Meaning of C : next slide.
Andre Martins (IST) Lecture 1 IST, Fall 2019 64 / 82
![Page 162: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/162.jpg)
An Important Multivariate RV: Gaussian
• Multivariate Gaussian: X ∈ Rn,
fX (x) = N(x ;µ,C ) =1√
det(2π C )exp
(−1
2(x − µ)TC−1(x − µ)
)
• Parameters: vector µ ∈ Rn and matrix C ∈ Rn×n.Expected value: E(X ) = µ. Meaning of C : next slide.
Andre Martins (IST) Lecture 1 IST, Fall 2019 64 / 82
![Page 163: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/163.jpg)
Covariance, Correlation, and all that...
• Covariance between two RVs:
cov(X ,Y ) = E[(X − E(X )
) (Y − E(Y )
)]= E(X Y )− E(X )E(Y )
• Relationship with variance: var(X ) = cov(X ,X ).
• Correlation: corr(X ,Y ) = ρ(X ,Y ) = cov(X ,Y )√var(X )
√var(Y )
∈ [−1, 1]
• X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)
⇒6⇐ cov(X , Y ) = 0 (example)
• Covariance matrix of multivariate RV, X ∈ Rn:
cov(X ) = E[(X − E(X )
)(X − E(X )
)T ]= E(X XT )− E(X )E(X )T
• Covariance of Gaussian RV, fX (x) = N(x ;µ,C ) ⇒ cov(X ) = C
Andre Martins (IST) Lecture 1 IST, Fall 2019 65 / 82
![Page 164: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/164.jpg)
Covariance, Correlation, and all that...
• Covariance between two RVs:
cov(X ,Y ) = E[(X − E(X )
) (Y − E(Y )
)]= E(X Y )− E(X )E(Y )
• Relationship with variance: var(X ) = cov(X ,X ).
• Correlation: corr(X ,Y ) = ρ(X ,Y ) = cov(X ,Y )√var(X )
√var(Y )
∈ [−1, 1]
• X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)
⇒6⇐ cov(X , Y ) = 0 (example)
• Covariance matrix of multivariate RV, X ∈ Rn:
cov(X ) = E[(X − E(X )
)(X − E(X )
)T ]= E(X XT )− E(X )E(X )T
• Covariance of Gaussian RV, fX (x) = N(x ;µ,C ) ⇒ cov(X ) = C
Andre Martins (IST) Lecture 1 IST, Fall 2019 65 / 82
![Page 165: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/165.jpg)
Covariance, Correlation, and all that...
• Covariance between two RVs:
cov(X ,Y ) = E[(X − E(X )
) (Y − E(Y )
)]= E(X Y )− E(X )E(Y )
• Relationship with variance: var(X ) = cov(X ,X ).
• Correlation: corr(X ,Y ) = ρ(X ,Y ) = cov(X ,Y )√var(X )
√var(Y )
∈ [−1, 1]
• X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)
⇒6⇐ cov(X , Y ) = 0 (example)
• Covariance matrix of multivariate RV, X ∈ Rn:
cov(X ) = E[(X − E(X )
)(X − E(X )
)T ]= E(X XT )− E(X )E(X )T
• Covariance of Gaussian RV, fX (x) = N(x ;µ,C ) ⇒ cov(X ) = C
Andre Martins (IST) Lecture 1 IST, Fall 2019 65 / 82
![Page 166: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/166.jpg)
Covariance, Correlation, and all that...
• Covariance between two RVs:
cov(X ,Y ) = E[(X − E(X )
) (Y − E(Y )
)]= E(X Y )− E(X )E(Y )
• Relationship with variance: var(X ) = cov(X ,X ).
• Correlation: corr(X ,Y ) = ρ(X ,Y ) = cov(X ,Y )√var(X )
√var(Y )
∈ [−1, 1]
• X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)
⇒6⇐ cov(X , Y ) = 0 (example)
• Covariance matrix of multivariate RV, X ∈ Rn:
cov(X ) = E[(X − E(X )
)(X − E(X )
)T ]= E(X XT )− E(X )E(X )T
• Covariance of Gaussian RV, fX (x) = N(x ;µ,C ) ⇒ cov(X ) = C
Andre Martins (IST) Lecture 1 IST, Fall 2019 65 / 82
![Page 167: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/167.jpg)
Covariance, Correlation, and all that...
• Covariance between two RVs:
cov(X ,Y ) = E[(X − E(X )
) (Y − E(Y )
)]= E(X Y )− E(X )E(Y )
• Relationship with variance: var(X ) = cov(X ,X ).
• Correlation: corr(X ,Y ) = ρ(X ,Y ) = cov(X ,Y )√var(X )
√var(Y )
∈ [−1, 1]
• X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)⇒6⇐ cov(X , Y ) = 0 (example)
• Covariance matrix of multivariate RV, X ∈ Rn:
cov(X ) = E[(X − E(X )
)(X − E(X )
)T ]= E(X XT )− E(X )E(X )T
• Covariance of Gaussian RV, fX (x) = N(x ;µ,C ) ⇒ cov(X ) = C
Andre Martins (IST) Lecture 1 IST, Fall 2019 65 / 82
![Page 168: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/168.jpg)
Covariance, Correlation, and all that...
• Covariance between two RVs:
cov(X ,Y ) = E[(X − E(X )
) (Y − E(Y )
)]= E(X Y )− E(X )E(Y )
• Relationship with variance: var(X ) = cov(X ,X ).
• Correlation: corr(X ,Y ) = ρ(X ,Y ) = cov(X ,Y )√var(X )
√var(Y )
∈ [−1, 1]
• X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)⇒6⇐ cov(X , Y ) = 0 (example)
• Covariance matrix of multivariate RV, X ∈ Rn:
cov(X ) = E[(X − E(X )
)(X − E(X )
)T ]= E(X XT )− E(X )E(X )T
• Covariance of Gaussian RV, fX (x) = N(x ;µ,C ) ⇒ cov(X ) = C
Andre Martins (IST) Lecture 1 IST, Fall 2019 65 / 82
![Page 169: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/169.jpg)
Covariance, Correlation, and all that...
• Covariance between two RVs:
cov(X ,Y ) = E[(X − E(X )
) (Y − E(Y )
)]= E(X Y )− E(X )E(Y )
• Relationship with variance: var(X ) = cov(X ,X ).
• Correlation: corr(X ,Y ) = ρ(X ,Y ) = cov(X ,Y )√var(X )
√var(Y )
∈ [−1, 1]
• X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)⇒6⇐ cov(X , Y ) = 0 (example)
• Covariance matrix of multivariate RV, X ∈ Rn:
cov(X ) = E[(X − E(X )
)(X − E(X )
)T ]= E(X XT )− E(X )E(X )T
• Covariance of Gaussian RV, fX (x) = N(x ;µ,C ) ⇒ cov(X ) = C
Andre Martins (IST) Lecture 1 IST, Fall 2019 65 / 82
![Page 170: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/170.jpg)
More on Expectations and Covariances
Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.
• If E(X ) = µ and Y = AX , then E(Y ) = Aµ;
• If cov(X ) = C and Y = AX , then cov(Y ) = ACAT ;
• If cov(X ) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;
• If cov(X ) = C and Y = C−1/2X , then cov(Y ) = I ;
• If fX (x) = N(x ; 0, I ) and Y = µ+ C 1/2X , then fY (y) = N(y ;µ,C );
• If fX (x) = N(x ;µ,C ) and Y =C−1/2(X − µ), then fY (y)=N(y ; 0, I ).
Andre Martins (IST) Lecture 1 IST, Fall 2019 66 / 82
![Page 171: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/171.jpg)
More on Expectations and Covariances
Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.
• If E(X ) = µ and Y = AX , then E(Y ) = Aµ;
• If cov(X ) = C and Y = AX , then cov(Y ) = ACAT ;
• If cov(X ) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;
• If cov(X ) = C and Y = C−1/2X , then cov(Y ) = I ;
• If fX (x) = N(x ; 0, I ) and Y = µ+ C 1/2X , then fY (y) = N(y ;µ,C );
• If fX (x) = N(x ;µ,C ) and Y =C−1/2(X − µ), then fY (y)=N(y ; 0, I ).
Andre Martins (IST) Lecture 1 IST, Fall 2019 66 / 82
![Page 172: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/172.jpg)
More on Expectations and Covariances
Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.
• If E(X ) = µ and Y = AX , then E(Y ) = Aµ;
• If cov(X ) = C and Y = AX , then cov(Y ) = ACAT ;
• If cov(X ) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;
• If cov(X ) = C and Y = C−1/2X , then cov(Y ) = I ;
• If fX (x) = N(x ; 0, I ) and Y = µ+ C 1/2X , then fY (y) = N(y ;µ,C );
• If fX (x) = N(x ;µ,C ) and Y =C−1/2(X − µ), then fY (y)=N(y ; 0, I ).
Andre Martins (IST) Lecture 1 IST, Fall 2019 66 / 82
![Page 173: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/173.jpg)
More on Expectations and Covariances
Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.
• If E(X ) = µ and Y = AX , then E(Y ) = Aµ;
• If cov(X ) = C and Y = AX , then cov(Y ) = ACAT ;
• If cov(X ) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;
• If cov(X ) = C and Y = C−1/2X , then cov(Y ) = I ;
• If fX (x) = N(x ; 0, I ) and Y = µ+ C 1/2X , then fY (y) = N(y ;µ,C );
• If fX (x) = N(x ;µ,C ) and Y =C−1/2(X − µ), then fY (y)=N(y ; 0, I ).
Andre Martins (IST) Lecture 1 IST, Fall 2019 66 / 82
![Page 174: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/174.jpg)
More on Expectations and Covariances
Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.
• If E(X ) = µ and Y = AX , then E(Y ) = Aµ;
• If cov(X ) = C and Y = AX , then cov(Y ) = ACAT ;
• If cov(X ) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;
• If cov(X ) = C and Y = C−1/2X , then cov(Y ) = I ;
• If fX (x) = N(x ; 0, I ) and Y = µ+ C 1/2X , then fY (y) = N(y ;µ,C );
• If fX (x) = N(x ;µ,C ) and Y =C−1/2(X − µ), then fY (y)=N(y ; 0, I ).
Andre Martins (IST) Lecture 1 IST, Fall 2019 66 / 82
![Page 175: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/175.jpg)
More on Expectations and Covariances
Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.
• If E(X ) = µ and Y = AX , then E(Y ) = Aµ;
• If cov(X ) = C and Y = AX , then cov(Y ) = ACAT ;
• If cov(X ) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;
• If cov(X ) = C and Y = C−1/2X , then cov(Y ) = I ;
• If fX (x) = N(x ; 0, I ) and Y = µ+ C 1/2X , then fY (y) = N(y ;µ,C );
• If fX (x) = N(x ;µ,C ) and Y =C−1/2(X − µ), then fY (y)=N(y ; 0, I ).
Andre Martins (IST) Lecture 1 IST, Fall 2019 66 / 82
![Page 176: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/176.jpg)
More on Expectations and Covariances
Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.
• If E(X ) = µ and Y = AX , then E(Y ) = Aµ;
• If cov(X ) = C and Y = AX , then cov(Y ) = ACAT ;
• If cov(X ) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;
• If cov(X ) = C and Y = C−1/2X , then cov(Y ) = I ;
• If fX (x) = N(x ; 0, I ) and Y = µ+ C 1/2X , then fY (y) = N(y ;µ,C );
• If fX (x) = N(x ;µ,C ) and Y =C−1/2(X − µ), then fY (y)=N(y ; 0, I ).
Andre Martins (IST) Lecture 1 IST, Fall 2019 66 / 82
![Page 177: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/177.jpg)
Central Limit Theorem
Take n independent r.v. X1, ...,Xn such that E[Xi ] = µi and var(Xi ) = σ2i
• Their sum, Yn =n∑
i=1
Xi satisfies:
E[Yn] =n∑
i=1
µi ≡ µ
var(Yn) =∑i
σ2i ≡ σ
• ...thus, if Zn =Yn − µσ
E[Zn] = 0 var(Zn) = 1
• Central limit theorem (CLT): under some mild conditions on X1, ...,Xn
limn→∞
Zn ∼ N(0, 1)
Andre Martins (IST) Lecture 1 IST, Fall 2019 67 / 82
![Page 178: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/178.jpg)
Central Limit Theorem
Take n independent r.v. X1, ...,Xn such that E[Xi ] = µi and var(Xi ) = σ2i
• Their sum, Yn =n∑
i=1
Xi satisfies:
E[Yn] =n∑
i=1
µi ≡ µ
var(Yn) =∑i
σ2i ≡ σ
• ...thus, if Zn =Yn − µσ
E[Zn] = 0 var(Zn) = 1
• Central limit theorem (CLT): under some mild conditions on X1, ...,Xn
limn→∞
Zn ∼ N(0, 1)
Andre Martins (IST) Lecture 1 IST, Fall 2019 67 / 82
![Page 179: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/179.jpg)
Central Limit Theorem
Take n independent r.v. X1, ...,Xn such that E[Xi ] = µi and var(Xi ) = σ2i
• Their sum, Yn =n∑
i=1
Xi satisfies:
E[Yn] =n∑
i=1
µi ≡ µ var(Yn) =∑i
σ2i ≡ σ
• ...thus, if Zn =Yn − µσ
E[Zn] = 0 var(Zn) = 1
• Central limit theorem (CLT): under some mild conditions on X1, ...,Xn
limn→∞
Zn ∼ N(0, 1)
Andre Martins (IST) Lecture 1 IST, Fall 2019 67 / 82
![Page 180: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/180.jpg)
Central Limit Theorem
Take n independent r.v. X1, ...,Xn such that E[Xi ] = µi and var(Xi ) = σ2i
• Their sum, Yn =n∑
i=1
Xi satisfies:
E[Yn] =n∑
i=1
µi ≡ µ var(Yn) =∑i
σ2i ≡ σ
• ...thus, if Zn =Yn − µσ
E[Zn] = 0 var(Zn) = 1
• Central limit theorem (CLT): under some mild conditions on X1, ...,Xn
limn→∞
Zn ∼ N(0, 1)
Andre Martins (IST) Lecture 1 IST, Fall 2019 67 / 82
![Page 181: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/181.jpg)
Central Limit Theorem
Illustration
Andre Martins (IST) Lecture 1 IST, Fall 2019 68 / 82
![Page 182: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/182.jpg)
Important Inequalities
• Markov’s ineqality: if X ≥ 0 is an RV with expectation E(X ), then
P(X > t) ≤ E(X )
t
Simple proof:
t P(X > t) =
∫ ∞t
t fX (x) dx ≤∫ ∞t
x fX (x) dx = E(X )−∫ t
0
x fX (x) dx︸ ︷︷ ︸≥0
≤ E(X )
• Chebyshev’s inequality: µ = E(Y ) and σ2 = var(Y ), then
P(|Y − µ| ≥ s) ≤ σ2
s2
...simple corollary of Markov’s inequality, with X = |Y − µ|2, t = s2
Andre Martins (IST) Lecture 1 IST, Fall 2019 69 / 82
![Page 183: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/183.jpg)
Important Inequalities
• Markov’s ineqality: if X ≥ 0 is an RV with expectation E(X ), then
P(X > t) ≤ E(X )
t
Simple proof:
t P(X > t) =
∫ ∞t
t fX (x) dx ≤∫ ∞t
x fX (x) dx = E(X )−∫ t
0
x fX (x) dx︸ ︷︷ ︸≥0
≤ E(X )
• Chebyshev’s inequality: µ = E(Y ) and σ2 = var(Y ), then
P(|Y − µ| ≥ s) ≤ σ2
s2
...simple corollary of Markov’s inequality, with X = |Y − µ|2, t = s2
Andre Martins (IST) Lecture 1 IST, Fall 2019 69 / 82
![Page 184: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/184.jpg)
Important Inequalities
• Markov’s ineqality: if X ≥ 0 is an RV with expectation E(X ), then
P(X > t) ≤ E(X )
t
Simple proof:
t P(X > t) =
∫ ∞t
t fX (x) dx ≤∫ ∞t
x fX (x) dx = E(X )−∫ t
0
x fX (x) dx︸ ︷︷ ︸≥0
≤ E(X )
• Chebyshev’s inequality: µ = E(Y ) and σ2 = var(Y ), then
P(|Y − µ| ≥ s) ≤ σ2
s2
...simple corollary of Markov’s inequality, with X = |Y − µ|2, t = s2
Andre Martins (IST) Lecture 1 IST, Fall 2019 69 / 82
![Page 185: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/185.jpg)
Other Important Inequalities: Cauchy-Schwartz
• Cauchy-Schwartz’s inequality for RVs:
|E(X Y )| ≤√
E(X 2)E(Y 2)
...why? Because 〈X ,Y 〉 ≡ E[XY ] is a valid inner product
• Important corollary: let E[X ] = µ and E[X ] = ν,
|cov(X ,Y )| = E[(X − µ)(Y − ν)]
≤√
E[(X − µ)2]E[(Y − ν)2]
=√
var(X ) var(Y )
• Implication for correlation:
corr(X ,Y ) = ρ(X ,Y ) =cov(X ,Y )√
var(X )√
var(Y )∈ [−1, 1]
Andre Martins (IST) Lecture 1 IST, Fall 2019 70 / 82
![Page 186: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/186.jpg)
Other Important Inequalities: Cauchy-Schwartz
• Cauchy-Schwartz’s inequality for RVs:
|E(X Y )| ≤√
E(X 2)E(Y 2)
...why? Because 〈X ,Y 〉 ≡ E[XY ] is a valid inner product
• Important corollary: let E[X ] = µ and E[X ] = ν,
|cov(X ,Y )| = E[(X − µ)(Y − ν)]
≤√
E[(X − µ)2]E[(Y − ν)2]
=√
var(X ) var(Y )
• Implication for correlation:
corr(X ,Y ) = ρ(X ,Y ) =cov(X ,Y )√
var(X )√
var(Y )∈ [−1, 1]
Andre Martins (IST) Lecture 1 IST, Fall 2019 70 / 82
![Page 187: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/187.jpg)
Other Important Inequalities: Cauchy-Schwartz
• Cauchy-Schwartz’s inequality for RVs:
|E(X Y )| ≤√
E(X 2)E(Y 2)
...why? Because 〈X ,Y 〉 ≡ E[XY ] is a valid inner product
• Important corollary: let E[X ] = µ and E[X ] = ν,
|cov(X ,Y )| = E[(X − µ)(Y − ν)]
≤√E[(X − µ)2]E[(Y − ν)2]
=√
var(X ) var(Y )
• Implication for correlation:
corr(X ,Y ) = ρ(X ,Y ) =cov(X ,Y )√
var(X )√
var(Y )∈ [−1, 1]
Andre Martins (IST) Lecture 1 IST, Fall 2019 70 / 82
![Page 188: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/188.jpg)
Other Important Inequalities: Cauchy-Schwartz
• Cauchy-Schwartz’s inequality for RVs:
|E(X Y )| ≤√
E(X 2)E(Y 2)
...why? Because 〈X ,Y 〉 ≡ E[XY ] is a valid inner product
• Important corollary: let E[X ] = µ and E[X ] = ν,
|cov(X ,Y )| = E[(X − µ)(Y − ν)]
≤√E[(X − µ)2]E[(Y − ν)2]
=√
var(X ) var(Y )
• Implication for correlation:
corr(X ,Y ) = ρ(X ,Y ) =cov(X ,Y )√
var(X )√
var(Y )∈ [−1, 1]
Andre Martins (IST) Lecture 1 IST, Fall 2019 70 / 82
![Page 189: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/189.jpg)
Other Important Inequalities: Cauchy-Schwartz
• Cauchy-Schwartz’s inequality for RVs:
|E(X Y )| ≤√
E(X 2)E(Y 2)
...why? Because 〈X ,Y 〉 ≡ E[XY ] is a valid inner product
• Important corollary: let E[X ] = µ and E[X ] = ν,
|cov(X ,Y )| = E[(X − µ)(Y − ν)]
≤√E[(X − µ)2]E[(Y − ν)2]
=√
var(X ) var(Y )
• Implication for correlation:
corr(X ,Y ) = ρ(X ,Y ) =cov(X ,Y )√
var(X )√
var(Y )∈ [−1, 1]
Andre Martins (IST) Lecture 1 IST, Fall 2019 70 / 82
![Page 190: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/190.jpg)
Other Important Inequalities: Jensen
• Recall that a real function g is convex if, for any x , y , and α ∈ [0, 1]
g(αx + (1− α)y) ≤ αg(x) + (1− α)g(y)
Jensen’s inequality: if g is a real convex function, then
g(E(X )) ≤ E(g(X ))
Examples: E(X )2 ≤ E(X 2) ⇒ var(X ) = E(X 2)− E(X )2 ≥ 0.
E(logX ) ≤ logE(X ), for X a positive RV.
Andre Martins (IST) Lecture 1 IST, Fall 2019 71 / 82
![Page 191: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/191.jpg)
Other Important Inequalities: Jensen
• Recall that a real function g is convex if, for any x , y , and α ∈ [0, 1]
g(αx + (1− α)y) ≤ αg(x) + (1− α)g(y)
Jensen’s inequality: if g is a real convex function, then
g(E(X )) ≤ E(g(X ))
Examples: E(X )2 ≤ E(X 2) ⇒ var(X ) = E(X 2)− E(X )2 ≥ 0.
E(logX ) ≤ logE(X ), for X a positive RV.
Andre Martins (IST) Lecture 1 IST, Fall 2019 71 / 82
![Page 192: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/192.jpg)
Other Important Inequalities: Jensen
• Recall that a real function g is convex if, for any x , y , and α ∈ [0, 1]
g(αx + (1− α)y) ≤ αg(x) + (1− α)g(y)
Jensen’s inequality: if g is a real convex function, then
g(E(X )) ≤ E(g(X ))
Examples: E(X )2 ≤ E(X 2) ⇒ var(X ) = E(X 2)− E(X )2 ≥ 0.
E(logX ) ≤ logE(X ), for X a positive RV.
Andre Martins (IST) Lecture 1 IST, Fall 2019 71 / 82
![Page 193: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/193.jpg)
Entropy and all that...
Entropy of a discrete RV X ∈ {1, ...,K}: H(X ) = −K∑
x=1
fX (x) log fX (x)
• Positivity: H(X ) ≥ 0 ;H(X ) = 0 ⇔ fX (i) = 1, for exactly one i ∈ {1, ...,K}.
• Upper bound: H(X ) ≤ logK ;H(X ) = logK ⇔ fX (x) = 1/k , for all x ∈ {1, ...,K}
• Measure of uncertainty/randomness of X
Continuous RV X , differential entropy: h(X ) = −∫
fX (x) log fX (x) dx
• h(X ) can be positive or negative. Example, iffX (x) = Uniform(x ; a, b), h(X ) = log(b − a).
• If fX (x) = N(x ;µ, σ2), then h(X ) = 12 log(2πeσ2).
• If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)
Andre Martins (IST) Lecture 1 IST, Fall 2019 72 / 82
![Page 194: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/194.jpg)
Entropy and all that...
Entropy of a discrete RV X ∈ {1, ...,K}: H(X ) = −K∑
x=1
fX (x) log fX (x)
• Positivity: H(X ) ≥ 0 ;H(X ) = 0 ⇔ fX (i) = 1, for exactly one i ∈ {1, ...,K}.
• Upper bound: H(X ) ≤ logK ;H(X ) = logK ⇔ fX (x) = 1/k , for all x ∈ {1, ...,K}
• Measure of uncertainty/randomness of X
Continuous RV X , differential entropy: h(X ) = −∫
fX (x) log fX (x) dx
• h(X ) can be positive or negative. Example, iffX (x) = Uniform(x ; a, b), h(X ) = log(b − a).
• If fX (x) = N(x ;µ, σ2), then h(X ) = 12 log(2πeσ2).
• If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)
Andre Martins (IST) Lecture 1 IST, Fall 2019 72 / 82
![Page 195: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/195.jpg)
Entropy and all that...
Entropy of a discrete RV X ∈ {1, ...,K}: H(X ) = −K∑
x=1
fX (x) log fX (x)
• Positivity: H(X ) ≥ 0 ;H(X ) = 0 ⇔ fX (i) = 1, for exactly one i ∈ {1, ...,K}.
• Upper bound: H(X ) ≤ logK ;H(X ) = logK ⇔ fX (x) = 1/k , for all x ∈ {1, ...,K}
• Measure of uncertainty/randomness of X
Continuous RV X , differential entropy: h(X ) = −∫
fX (x) log fX (x) dx
• h(X ) can be positive or negative. Example, iffX (x) = Uniform(x ; a, b), h(X ) = log(b − a).
• If fX (x) = N(x ;µ, σ2), then h(X ) = 12 log(2πeσ2).
• If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)
Andre Martins (IST) Lecture 1 IST, Fall 2019 72 / 82
![Page 196: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/196.jpg)
Entropy and all that...
Entropy of a discrete RV X ∈ {1, ...,K}: H(X ) = −K∑
x=1
fX (x) log fX (x)
• Positivity: H(X ) ≥ 0 ;H(X ) = 0 ⇔ fX (i) = 1, for exactly one i ∈ {1, ...,K}.
• Upper bound: H(X ) ≤ logK ;H(X ) = logK ⇔ fX (x) = 1/k , for all x ∈ {1, ...,K}
• Measure of uncertainty/randomness of X
Continuous RV X , differential entropy: h(X ) = −∫
fX (x) log fX (x) dx
• h(X ) can be positive or negative. Example, iffX (x) = Uniform(x ; a, b), h(X ) = log(b − a).
• If fX (x) = N(x ;µ, σ2), then h(X ) = 12 log(2πeσ2).
• If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)
Andre Martins (IST) Lecture 1 IST, Fall 2019 72 / 82
![Page 197: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/197.jpg)
Entropy and all that...
Entropy of a discrete RV X ∈ {1, ...,K}: H(X ) = −K∑
x=1
fX (x) log fX (x)
• Positivity: H(X ) ≥ 0 ;H(X ) = 0 ⇔ fX (i) = 1, for exactly one i ∈ {1, ...,K}.
• Upper bound: H(X ) ≤ logK ;H(X ) = logK ⇔ fX (x) = 1/k , for all x ∈ {1, ...,K}
• Measure of uncertainty/randomness of X
Continuous RV X , differential entropy: h(X ) = −∫
fX (x) log fX (x) dx
• h(X ) can be positive or negative. Example, iffX (x) = Uniform(x ; a, b), h(X ) = log(b − a).
• If fX (x) = N(x ;µ, σ2), then h(X ) = 12 log(2πeσ2).
• If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)
Andre Martins (IST) Lecture 1 IST, Fall 2019 72 / 82
![Page 198: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/198.jpg)
Entropy and all that...
Entropy of a discrete RV X ∈ {1, ...,K}: H(X ) = −K∑
x=1
fX (x) log fX (x)
• Positivity: H(X ) ≥ 0 ;H(X ) = 0 ⇔ fX (i) = 1, for exactly one i ∈ {1, ...,K}.
• Upper bound: H(X ) ≤ logK ;H(X ) = logK ⇔ fX (x) = 1/k , for all x ∈ {1, ...,K}
• Measure of uncertainty/randomness of X
Continuous RV X , differential entropy: h(X ) = −∫
fX (x) log fX (x) dx
• h(X ) can be positive or negative. Example, iffX (x) = Uniform(x ; a, b), h(X ) = log(b − a).
• If fX (x) = N(x ;µ, σ2), then h(X ) = 12 log(2πeσ2).
• If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)
Andre Martins (IST) Lecture 1 IST, Fall 2019 72 / 82
![Page 199: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/199.jpg)
Entropy and all that...
Entropy of a discrete RV X ∈ {1, ...,K}: H(X ) = −K∑
x=1
fX (x) log fX (x)
• Positivity: H(X ) ≥ 0 ;H(X ) = 0 ⇔ fX (i) = 1, for exactly one i ∈ {1, ...,K}.
• Upper bound: H(X ) ≤ logK ;H(X ) = logK ⇔ fX (x) = 1/k , for all x ∈ {1, ...,K}
• Measure of uncertainty/randomness of X
Continuous RV X , differential entropy: h(X ) = −∫
fX (x) log fX (x) dx
• h(X ) can be positive or negative. Example, iffX (x) = Uniform(x ; a, b), h(X ) = log(b − a).
• If fX (x) = N(x ;µ, σ2), then h(X ) = 12 log(2πeσ2).
• If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)
Andre Martins (IST) Lecture 1 IST, Fall 2019 72 / 82
![Page 200: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/200.jpg)
Entropy and all that...
Entropy of a discrete RV X ∈ {1, ...,K}: H(X ) = −K∑
x=1
fX (x) log fX (x)
• Positivity: H(X ) ≥ 0 ;H(X ) = 0 ⇔ fX (i) = 1, for exactly one i ∈ {1, ...,K}.
• Upper bound: H(X ) ≤ logK ;H(X ) = logK ⇔ fX (x) = 1/k , for all x ∈ {1, ...,K}
• Measure of uncertainty/randomness of X
Continuous RV X , differential entropy: h(X ) = −∫
fX (x) log fX (x) dx
• h(X ) can be positive or negative. Example, iffX (x) = Uniform(x ; a, b), h(X ) = log(b − a).
• If fX (x) = N(x ;µ, σ2), then h(X ) = 12 log(2πeσ2).
• If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)
Andre Martins (IST) Lecture 1 IST, Fall 2019 72 / 82
![Page 201: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/201.jpg)
Kullback-Leibler divergence
Kullback-Leibler divergence (KLD) between two pmf:
D(fX‖gX ) =K∑
x=1
fX (x) logfX (x)
gX (x)
Positivity: D(fX‖gX ) ≥ 0D(fX‖gX ) = 0 ⇔ fX (x) = gX (x), for x ∈ {1, ...,K}
KLD between two pdf:
D(fX‖gX ) =
∫fX (x) log
fX (x)
gX (x)dx
Positivity: D(fX‖gX ) ≥ 0D(fX‖gX ) = 0 ⇔ fX (x) = gX (x), almost everywhere
Andre Martins (IST) Lecture 1 IST, Fall 2019 73 / 82
![Page 202: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/202.jpg)
Kullback-Leibler divergence
Kullback-Leibler divergence (KLD) between two pmf:
D(fX‖gX ) =K∑
x=1
fX (x) logfX (x)
gX (x)
Positivity: D(fX‖gX ) ≥ 0D(fX‖gX ) = 0 ⇔ fX (x) = gX (x), for x ∈ {1, ...,K}
KLD between two pdf:
D(fX‖gX ) =
∫fX (x) log
fX (x)
gX (x)dx
Positivity: D(fX‖gX ) ≥ 0D(fX‖gX ) = 0 ⇔ fX (x) = gX (x), almost everywhere
Andre Martins (IST) Lecture 1 IST, Fall 2019 73 / 82
![Page 203: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/203.jpg)
Kullback-Leibler divergence
Kullback-Leibler divergence (KLD) between two pmf:
D(fX‖gX ) =K∑
x=1
fX (x) logfX (x)
gX (x)
Positivity: D(fX‖gX ) ≥ 0D(fX‖gX ) = 0 ⇔ fX (x) = gX (x), for x ∈ {1, ...,K}
KLD between two pdf:
D(fX‖gX ) =
∫fX (x) log
fX (x)
gX (x)dx
Positivity: D(fX‖gX ) ≥ 0D(fX‖gX ) = 0 ⇔ fX (x) = gX (x), almost everywhere
Andre Martins (IST) Lecture 1 IST, Fall 2019 73 / 82
![Page 204: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/204.jpg)
Kullback-Leibler divergence
Kullback-Leibler divergence (KLD) between two pmf:
D(fX‖gX ) =K∑
x=1
fX (x) logfX (x)
gX (x)
Positivity: D(fX‖gX ) ≥ 0D(fX‖gX ) = 0 ⇔ fX (x) = gX (x), for x ∈ {1, ...,K}
KLD between two pdf:
D(fX‖gX ) =
∫fX (x) log
fX (x)
gX (x)dx
Positivity: D(fX‖gX ) ≥ 0D(fX‖gX ) = 0 ⇔ fX (x) = gX (x), almost everywhere
Andre Martins (IST) Lecture 1 IST, Fall 2019 73 / 82
![Page 205: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/205.jpg)
Mutual information
Mutual information (MI) between two random variables:
I (X ;Y ) = D(fX ,Y ‖fX fY
)
Positivity: I (X ;Y ) ≥ 0I (X ;Y ) = 0 ⇔ X ,Y are independent.
MI is a measure of dependency between two random variables
Andre Martins (IST) Lecture 1 IST, Fall 2019 74 / 82
![Page 206: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/206.jpg)
Mutual information
Mutual information (MI) between two random variables:
I (X ;Y ) = D(fX ,Y ‖fX fY
)Positivity: I (X ;Y ) ≥ 0
I (X ;Y ) = 0 ⇔ X ,Y are independent.
MI is a measure of dependency between two random variables
Andre Martins (IST) Lecture 1 IST, Fall 2019 74 / 82
![Page 207: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/207.jpg)
Mutual information
Mutual information (MI) between two random variables:
I (X ;Y ) = D(fX ,Y ‖fX fY
)Positivity: I (X ;Y ) ≥ 0
I (X ;Y ) = 0 ⇔ X ,Y are independent.
MI is a measure of dependency between two random variables
Andre Martins (IST) Lecture 1 IST, Fall 2019 74 / 82
![Page 208: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/208.jpg)
Recommended Reading
• K. Murphy, “Machine Learning: A Probabilistic Perspective”, MITPress, 2012.
• L. Wasserman, “All of Statistics: A Concise Course in StatisticalInference”, Springer, 2004.
Andre Martins (IST) Lecture 1 IST, Fall 2019 75 / 82
![Page 209: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/209.jpg)
Outline
1 Introduction
2 Class Administrativia
3 Recap
Linear Algebra
Probability Theory
Optimization
Andre Martins (IST) Lecture 1 IST, Fall 2019 76 / 82
![Page 210: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/210.jpg)
Minimizing a function
• We are given a function f : Rn → R.
• Goal: find x∗ that minimizes f : Rn → R.
• Global minimum: for any x ∈ Rn, f (x∗) ≤ f (x).
• Local minimum: for any ‖x − x∗‖ ≤ δ ⇒ f (x∗) ≤ f (x).
Are these global minima ?
• No, (local minima, saddle points, . . . )
Andre Martins (IST) Lecture 1 IST, Fall 2019 77 / 82
![Page 211: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/211.jpg)
Iterative descent methods
Goal: find the minimum/minimizer of f : Rd → R
• Proceed in small steps in the optimal direction till a stoppingcriterion is met.• Gradient descent: updates of the form: x (t+1) ← x (t)− η(t)∇f (x (t))
Figure: Illustration of gradient descent.The blue circles correspond to thefunction values at different points, while the red lines correspond to steps taken inthe negative gradient direction.
Andre Martins (IST) Lecture 1 IST, Fall 2019 78 / 82
![Page 212: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/212.jpg)
Convex functions
Pro: Guarantee of a global minima X
Figure: Illustration of a convex function. The line segment between any twopoints on the graph lies entirely above the curve.
Andre Martins (IST) Lecture 1 IST, Fall 2019 79 / 82
![Page 213: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/213.jpg)
Non-Convex functions
Pro: No guarantee of a global minima 7
Figure: Illustration of a non-convex function. Note the line segment intersectingthe curve.
Andre Martins (IST) Lecture 1 IST, Fall 2019 80 / 82
![Page 214: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/214.jpg)
Thank you!
Questions?
Andre Martins (IST) Lecture 1 IST, Fall 2019 81 / 82
![Page 215: Lecture 1: Introduction - GitHub Pages · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2019 Andr e Martins (IST) Lecture 1 IST, Fall 2019 1/82](https://reader035.vdocuments.us/reader035/viewer/2022062922/5f061e577e708231d41661e3/html5/thumbnails/215.jpg)
References I
Andre Martins (IST) Lecture 1 IST, Fall 2019 82 / 82