probability theory refresher - lxmls 2020lxmls.it.pt/2016/lecture_0.pdf · probability theory...

171
Probability Theory Refresher ario A. T. Figueiredo Instituto Superior T´ ecnico & Instituto de Telecomunica¸ oes Lisboa, Portugal LxMLS 2016: Lisbon Machine Learning School July 21, 2016 ario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 1 / 40

Upload: others

Post on 27-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Probability Theory Refresher

Mario A. T. Figueiredo

Instituto Superior Tecnico & Instituto de Telecomunicacoes

Lisboa, Portugal

LxMLS 2016: Lisbon Machine Learning School

July 21, 2016

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 1 / 40

Page 2: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Probability theory

The study of probability has roots in games of chance

Great names of science: Cardano, Fermat, Pascal, Laplace,Kolmogorov, Bernoulli, Poisson, Cauchy, Boltzman, de Finetti, ...

Natural tool to model uncertainty, information, knowledge, belief,observations, ...

...thus also learning, decision making, inference, science,...

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 2 / 40

Page 3: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Probability theory

The study of probability has roots in games of chance

Great names of science: Cardano, Fermat, Pascal, Laplace,Kolmogorov, Bernoulli, Poisson, Cauchy, Boltzman, de Finetti, ...

Natural tool to model uncertainty, information, knowledge, belief,observations, ...

...thus also learning, decision making, inference, science,...

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 2 / 40

Page 4: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Probability theory

The study of probability has roots in games of chance

Great names of science: Cardano, Fermat, Pascal, Laplace,Kolmogorov, Bernoulli, Poisson, Cauchy, Boltzman, de Finetti, ...

Natural tool to model uncertainty, information, knowledge, belief,observations, ...

...thus also learning, decision making, inference, science,...

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 2 / 40

Page 5: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Probability theory

The study of probability has roots in games of chance

Great names of science: Cardano, Fermat, Pascal, Laplace,Kolmogorov, Bernoulli, Poisson, Cauchy, Boltzman, de Finetti, ...

Natural tool to model uncertainty, information, knowledge, belief,observations, ...

...thus also learning, decision making, inference, science,...

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 2 / 40

Page 6: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Probability theory

The study of probability has roots in games of chance

Great names of science: Cardano, Fermat, Pascal, Laplace,Kolmogorov, Bernoulli, Poisson, Cauchy, Boltzman, de Finetti, ...

Natural tool to model uncertainty, information, knowledge, belief,observations, ...

...thus also learning, decision making, inference, science,...

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 2 / 40

Page 7: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

What is probability?

Classical definition: P(A) =NA

N

...with N mutually exclusive equally likely outcomes,

NA of which result in the occurrence of A. Laplace, 1814

Example: P(randomly drawn card is ♣) = 13/52.

Example: P(getting 1 in throwing a fair die) = 1/6.

Frequentist definition: P(A) = limN→∞

NA

N

...relative frequency of occurrence of A in infinite number of trials.

Subjective probability: P(A) is a degree of belief. de Finetti, 1930s

...gives meaning to P(“it will rain tomorrow”).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 3 / 40

Page 8: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

What is probability?

Classical definition: P(A) =NA

N

...with N mutually exclusive equally likely outcomes,

NA of which result in the occurrence of A. Laplace, 1814

Example: P(randomly drawn card is ♣) = 13/52.

Example: P(getting 1 in throwing a fair die) = 1/6.

Frequentist definition: P(A) = limN→∞

NA

N

...relative frequency of occurrence of A in infinite number of trials.

Subjective probability: P(A) is a degree of belief. de Finetti, 1930s

...gives meaning to P(“it will rain tomorrow”).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 3 / 40

Page 9: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

What is probability?

Classical definition: P(A) =NA

N

...with N mutually exclusive equally likely outcomes,

NA of which result in the occurrence of A. Laplace, 1814

Example: P(randomly drawn card is ♣) = 13/52.

Example: P(getting 1 in throwing a fair die) = 1/6.

Frequentist definition: P(A) = limN→∞

NA

N

...relative frequency of occurrence of A in infinite number of trials.

Subjective probability: P(A) is a degree of belief. de Finetti, 1930s

...gives meaning to P(“it will rain tomorrow”).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 3 / 40

Page 10: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Key concepts: Sample space and events

Sample space X = set of possible outcomes of a random experiment.

Examples:I Tossing two coins: X = {HH,TH,HT, TT}I Roulette: X = {1, 2, ..., 36}I Draw a card from a shuffled deck: X = {A♣, 2♣, ..., Q♦,K♦}.

An event A is a subset of X : A ⊆ X (also written A ∈ 2X ).

Examples:I “exactly one H in 2-coin toss”:A = {TH,HT} ⊂ {HH,TH,HT, TT}.

I “odd number in the roulette”: B = {1, 3, ..., 35} ⊂ {1, 2, ..., 36}.

I “drawn a ♥ card”: C = {A♥, 2♥, ...,K♥} ⊂ {A♣, ...,K♦}

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 4 / 40

Page 11: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Key concepts: Sample space and events

Sample space X = set of possible outcomes of a random experiment.

Examples:I Tossing two coins: X = {HH,TH,HT, TT}I Roulette: X = {1, 2, ..., 36}I Draw a card from a shuffled deck: X = {A♣, 2♣, ..., Q♦,K♦}.

An event A is a subset of X : A ⊆ X (also written A ∈ 2X ).

Examples:I “exactly one H in 2-coin toss”:A = {TH,HT} ⊂ {HH,TH,HT, TT}.

I “odd number in the roulette”: B = {1, 3, ..., 35} ⊂ {1, 2, ..., 36}.

I “drawn a ♥ card”: C = {A♥, 2♥, ...,K♥} ⊂ {A♣, ...,K♦}

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 4 / 40

Page 12: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Key concepts: Sample space and events

Sample space X = set of possible outcomes of a random experiment.

(More delicate) examples:I Distance travelled by tossed die: X = R+

I Location of the next rain drop on a given square tile: X = R2

Properly handling the continuous case requires deeper concepts:

I Let Σ be collection of subsets of X : Σ ⊆ 2X

I Σ is a σ−algebra if

F A ∈ Σ ⇒ Ac ∈ Σ

F A1, A2, ... ∈ Σ ⇒∞⋃i=1

Ai ∈ Σ

I Corollary: if Σ ⊂ 2X is a σ-algebra, ∅ ∈ Σ and X ∈ Σ

I Example in Rn: collection of Lebesgue-measurable sets is a σ−algebra.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 5 / 40

Page 13: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Key concepts: Sample space and events

Sample space X = set of possible outcomes of a random experiment.

(More delicate) examples:I Distance travelled by tossed die: X = R+

I Location of the next rain drop on a given square tile: X = R2

Properly handling the continuous case requires deeper concepts:

I Let Σ be collection of subsets of X : Σ ⊆ 2X

I Σ is a σ−algebra if

F A ∈ Σ ⇒ Ac ∈ Σ

F A1, A2, ... ∈ Σ ⇒∞⋃i=1

Ai ∈ Σ

I Corollary: if Σ ⊂ 2X is a σ-algebra, ∅ ∈ Σ and X ∈ Σ

I Example in Rn: collection of Lebesgue-measurable sets is a σ−algebra.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 5 / 40

Page 14: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Key concepts: Sample space and events

Sample space X = set of possible outcomes of a random experiment.

(More delicate) examples:I Distance travelled by tossed die: X = R+

I Location of the next rain drop on a given square tile: X = R2

Properly handling the continuous case requires deeper concepts:

I Let Σ be collection of subsets of X : Σ ⊆ 2X

I Σ is a σ−algebra if

F A ∈ Σ ⇒ Ac ∈ Σ

F A1, A2, ... ∈ Σ ⇒∞⋃i=1

Ai ∈ Σ

I Corollary: if Σ ⊂ 2X is a σ-algebra, ∅ ∈ Σ and X ∈ Σ

I Example in Rn: collection of Lebesgue-measurable sets is a σ−algebra.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 5 / 40

Page 15: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Key concepts: Sample space and events

Sample space X = set of possible outcomes of a random experiment.

(More delicate) examples:I Distance travelled by tossed die: X = R+

I Location of the next rain drop on a given square tile: X = R2

Properly handling the continuous case requires deeper concepts:

I Let Σ be collection of subsets of X : Σ ⊆ 2X

I Σ is a σ−algebra if

F A ∈ Σ ⇒ Ac ∈ Σ

F A1, A2, ... ∈ Σ ⇒∞⋃i=1

Ai ∈ Σ

I Corollary: if Σ ⊂ 2X is a σ-algebra, ∅ ∈ Σ and X ∈ Σ

I Example in Rn: collection of Lebesgue-measurable sets is a σ−algebra.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 5 / 40

Page 16: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Key concepts: Sample space and events

Sample space X = set of possible outcomes of a random experiment.

(More delicate) examples:I Distance travelled by tossed die: X = R+

I Location of the next rain drop on a given square tile: X = R2

Properly handling the continuous case requires deeper concepts:

I Let Σ be collection of subsets of X : Σ ⊆ 2X

I Σ is a σ−algebra if

F A ∈ Σ ⇒ Ac ∈ Σ

F A1, A2, ... ∈ Σ ⇒∞⋃i=1

Ai ∈ Σ

I Corollary: if Σ ⊂ 2X is a σ-algebra, ∅ ∈ Σ and X ∈ Σ

I Example in Rn: collection of Lebesgue-measurable sets is a σ−algebra.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 5 / 40

Page 17: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Key concepts: Sample space and events

Sample space X = set of possible outcomes of a random experiment.

(More delicate) examples:I Distance travelled by tossed die: X = R+

I Location of the next rain drop on a given square tile: X = R2

Properly handling the continuous case requires deeper concepts:

I Let Σ be collection of subsets of X : Σ ⊆ 2X

I Σ is a σ−algebra if

F A ∈ Σ ⇒ Ac ∈ Σ

F A1, A2, ... ∈ Σ ⇒∞⋃i=1

Ai ∈ Σ

I Corollary: if Σ ⊂ 2X is a σ-algebra, ∅ ∈ Σ and X ∈ Σ

I Example in Rn: collection of Lebesgue-measurable sets is a σ−algebra.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 5 / 40

Page 18: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Kolmogorov’s Axioms for Probability

Probability is a function that maps events A into the interval [0, 1].

Kolmogorov’s axioms (1933) for probability P : Σ→ [0, 1]

I For any A, P(A) ≥ 0

I P(X ) = 1

I If A1, A2 ... ⊆ X are disjoint events, then P(⋃i

Ai

)=∑i

P(Ai)

From these axioms, many results can be derived. Examples:

I P(∅) = 0

I C ⊂ D ⇒ P(C) ≤ P(D)

I P(A ∪B) = P(A) + P(B)− P(A ∩B)

I P(A ∪B) ≤ P(A) + P(B) (union bound)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 6 / 40

Page 19: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Kolmogorov’s Axioms for Probability

Probability is a function that maps events A into the interval [0, 1].

Kolmogorov’s axioms (1933) for probability P : Σ→ [0, 1]

I For any A, P(A) ≥ 0

I P(X ) = 1

I If A1, A2 ... ⊆ X are disjoint events, then P(⋃i

Ai

)=∑i

P(Ai)

From these axioms, many results can be derived. Examples:

I P(∅) = 0

I C ⊂ D ⇒ P(C) ≤ P(D)

I P(A ∪B) = P(A) + P(B)− P(A ∩B)

I P(A ∪B) ≤ P(A) + P(B) (union bound)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 6 / 40

Page 20: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Kolmogorov’s Axioms for Probability

Probability is a function that maps events A into the interval [0, 1].

Kolmogorov’s axioms (1933) for probability P : Σ→ [0, 1]

I For any A, P(A) ≥ 0

I P(X ) = 1

I If A1, A2 ... ⊆ X are disjoint events, then P(⋃i

Ai

)=∑i

P(Ai)

From these axioms, many results can be derived. Examples:

I P(∅) = 0

I C ⊂ D ⇒ P(C) ≤ P(D)

I P(A ∪B) = P(A) + P(B)− P(A ∩B)

I P(A ∪B) ≤ P(A) + P(B) (union bound)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 6 / 40

Page 21: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Kolmogorov’s Axioms for Probability

Probability is a function that maps events A into the interval [0, 1].

Kolmogorov’s axioms (1933) for probability P : Σ→ [0, 1]

I For any A, P(A) ≥ 0

I P(X ) = 1

I If A1, A2 ... ⊆ X are disjoint events, then P(⋃i

Ai

)=∑i

P(Ai)

From these axioms, many results can be derived. Examples:

I P(∅) = 0

I C ⊂ D ⇒ P(C) ≤ P(D)

I P(A ∪B) = P(A) + P(B)− P(A ∩B)

I P(A ∪B) ≤ P(A) + P(B) (union bound)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 6 / 40

Page 22: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Kolmogorov’s Axioms for Probability

Probability is a function that maps events A into the interval [0, 1].

Kolmogorov’s axioms (1933) for probability P : Σ→ [0, 1]

I For any A, P(A) ≥ 0

I P(X ) = 1

I If A1, A2 ... ⊆ X are disjoint events, then P(⋃i

Ai

)=∑i

P(Ai)

From these axioms, many results can be derived. Examples:

I P(∅) = 0

I C ⊂ D ⇒ P(C) ≤ P(D)

I P(A ∪B) = P(A) + P(B)− P(A ∩B)

I P(A ∪B) ≤ P(A) + P(B) (union bound)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 6 / 40

Page 23: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Conditional Probability and Independence

If P(B) > 0, P(A|B) =P(A ∩B)

P(B)(conditional prob. of A, given B)

...satisfies all of Kolmogorov’s axioms:

I For any A ⊆ X , P(A|B) ≥ 0

I P(X|B) = 1

I If A1, A2, ... ⊆ X are disjoint, then

P(⋃i

Ai

∣∣∣B) =∑i

P(Ai|B)

Independence: A, B are independent (denoted A ⊥⊥ B) if and only if

P(A ∩B) = P(A)P(B).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 7 / 40

Page 24: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Conditional Probability and Independence

If P(B) > 0, P(A|B) =P(A ∩B)

P(B)(conditional prob. of A, given B)

...satisfies all of Kolmogorov’s axioms:

I For any A ⊆ X , P(A|B) ≥ 0

I P(X|B) = 1

I If A1, A2, ... ⊆ X are disjoint, then

P(⋃i

Ai

∣∣∣B) =∑i

P(Ai|B)

Independence: A, B are independent (denoted A ⊥⊥ B) if and only if

P(A ∩B) = P(A)P(B).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 7 / 40

Page 25: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Conditional Probability and Independence

If P(B) > 0, P(A|B) =P(A ∩B)

P(B)(conditional prob. of A, given B)

...satisfies all of Kolmogorov’s axioms:

I For any A ⊆ X , P(A|B) ≥ 0

I P(X|B) = 1

I If A1, A2, ... ⊆ X are disjoint, then

P(⋃i

Ai

∣∣∣B) =∑i

P(Ai|B)

Independence: A, B are independent (denoted A ⊥⊥ B) if and only if

P(A ∩B) = P(A)P(B).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 7 / 40

Page 26: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Conditional Probability and Independence

If P(B) > 0, P(A|B) =P(A ∩B)

P(B)

Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩B) = P(A)P(B).

Relationship with conditional probabilities:

A ⊥⊥ B ⇔ P(A|B) = P(A)

Example: X = “52 cards”, A = {3♥, 3♣, 3♦, 3♣}, andB = {A♥, 2♥, ...,K♥}; then, P(A) = 1/13, P(B) = 1/4

P(A ∩B) = P({3♥}) =1

52

P(A)P(B) =1

13

1

4=

1

52

P(A|B) = P(“3”|“♥”) =1

13= P(A)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 8 / 40

Page 27: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Conditional Probability and Independence

If P(B) > 0, P(A|B) =P(A ∩B)

P(B)

Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩B) = P(A)P(B).

Relationship with conditional probabilities:

A ⊥⊥ B ⇔ P(A|B) = P(A)

Example: X = “52 cards”, A = {3♥, 3♣, 3♦, 3♣}, andB = {A♥, 2♥, ...,K♥}; then, P(A) = 1/13, P(B) = 1/4

P(A ∩B) = P({3♥}) =1

52

P(A)P(B) =1

13

1

4=

1

52

P(A|B) = P(“3”|“♥”) =1

13= P(A)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 8 / 40

Page 28: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Conditional Probability and Independence

If P(B) > 0, P(A|B) =P(A ∩B)

P(B)

Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩B) = P(A)P(B).

Relationship with conditional probabilities:

A ⊥⊥ B ⇔ P(A|B) = P(A)

Example: X = “52 cards”, A = {3♥, 3♣, 3♦, 3♣}, andB = {A♥, 2♥, ...,K♥}; then, P(A) = 1/13, P(B) = 1/4

P(A ∩B) = P({3♥}) =1

52

P(A)P(B) =1

13

1

4=

1

52

P(A|B) = P(“3”|“♥”) =1

13= P(A)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 8 / 40

Page 29: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Conditional Probability and Independence

If P(B) > 0, P(A|B) =P(A ∩B)

P(B)

Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩B) = P(A)P(B).

Relationship with conditional probabilities:

A ⊥⊥ B ⇔ P(A|B) = P(A)

Example: X = “52 cards”, A = {3♥, 3♣, 3♦, 3♣}, andB = {A♥, 2♥, ...,K♥}; then, P(A) = 1/13, P(B) = 1/4

P(A ∩B) = P({3♥}) =1

52

P(A)P(B) =1

13

1

4=

1

52

P(A|B) = P(“3”|“♥”) =1

13= P(A)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 8 / 40

Page 30: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Conditional Probability and Independence

If P(B) > 0, P(A|B) =P(A ∩B)

P(B)

Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩B) = P(A)P(B).

Relationship with conditional probabilities:

A ⊥⊥ B ⇔ P(A|B) = P(A)

Example: X = “52 cards”, A = {3♥, 3♣, 3♦, 3♣}, andB = {A♥, 2♥, ...,K♥}; then, P(A) = 1/13, P(B) = 1/4

P(A ∩B) = P({3♥}) =1

52

P(A)P(B) =1

13

1

4=

1

52

P(A|B) = P(“3”|“♥”) =1

13= P(A)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 8 / 40

Page 31: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Conditional Probability and Independence

If P(B) > 0, P(A|B) =P(A ∩B)

P(B)

Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩B) = P(A)P(B).

Relationship with conditional probabilities:

A ⊥⊥ B ⇔ P(A|B) = P(A)

Example: X = “52 cards”, A = {3♥, 3♣, 3♦, 3♣}, andB = {A♥, 2♥, ...,K♥}; then, P(A) = 1/13, P(B) = 1/4

P(A ∩B) = P({3♥}) =1

52

P(A)P(B) =1

13

1

4=

1

52

P(A|B) = P(“3”|“♥”) =1

13= P(A)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 8 / 40

Page 32: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Bayes Theorem

Law of total probability: if A1, ..., An are a partition of X

P(B) =∑i

P(B|Ai)P(Ai)

=∑i

P(B ∩Ai)

Bayes’ theorem: if {A1, ..., An} is a partition of X

P(Ai|B) =P(B ∩Ai)

P(B)=

P(B|Ai) P(Ai)∑i

P(B|Ai)P(Ai)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 9 / 40

Page 33: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Bayes Theorem

Law of total probability: if A1, ..., An are a partition of X

P(B) =∑i

P(B|Ai)P(Ai)

=∑i

P(B ∩Ai)

Bayes’ theorem: if {A1, ..., An} is a partition of X

P(Ai|B) =P(B ∩Ai)

P(B)=

P(B|Ai) P(Ai)∑i

P(B|Ai)P(Ai)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 9 / 40

Page 34: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Random Variables

A (real) random variable (RV) is a function: X : X → R

I Discrete RV: range of X is countable (e.g., N or {0, 1})I Continuous RV: range of X is uncountable (e.g., R or [0, 1])

I Example: number of head in tossing two coins,X = {HH,HT, TH, TT},X(HH) = 2, X(HT ) = X(TH) = 1, X(TT ) = 0.Range of X = {0, 1, 2}.

I Example: distance traveled by a tossed coin; range of X = R+.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 10 / 40

Page 35: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Random Variables

A (real) random variable (RV) is a function: X : X → R

I Discrete RV: range of X is countable (e.g., N or {0, 1})

I Continuous RV: range of X is uncountable (e.g., R or [0, 1])

I Example: number of head in tossing two coins,X = {HH,HT, TH, TT},X(HH) = 2, X(HT ) = X(TH) = 1, X(TT ) = 0.Range of X = {0, 1, 2}.

I Example: distance traveled by a tossed coin; range of X = R+.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 10 / 40

Page 36: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Random Variables

A (real) random variable (RV) is a function: X : X → R

I Discrete RV: range of X is countable (e.g., N or {0, 1})I Continuous RV: range of X is uncountable (e.g., R or [0, 1])

I Example: number of head in tossing two coins,X = {HH,HT, TH, TT},X(HH) = 2, X(HT ) = X(TH) = 1, X(TT ) = 0.Range of X = {0, 1, 2}.

I Example: distance traveled by a tossed coin; range of X = R+.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 10 / 40

Page 37: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Random Variables

A (real) random variable (RV) is a function: X : X → R

I Discrete RV: range of X is countable (e.g., N or {0, 1})I Continuous RV: range of X is uncountable (e.g., R or [0, 1])

I Example: number of head in tossing two coins,X = {HH,HT, TH, TT},X(HH) = 2, X(HT ) = X(TH) = 1, X(TT ) = 0.Range of X = {0, 1, 2}.

I Example: distance traveled by a tossed coin; range of X = R+.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 10 / 40

Page 38: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Random Variables

A (real) random variable (RV) is a function: X : X → R

I Discrete RV: range of X is countable (e.g., N or {0, 1})I Continuous RV: range of X is uncountable (e.g., R or [0, 1])

I Example: number of head in tossing two coins,X = {HH,HT, TH, TT},X(HH) = 2, X(HT ) = X(TH) = 1, X(TT ) = 0.Range of X = {0, 1, 2}.

I Example: distance traveled by a tossed coin; range of X = R+.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 10 / 40

Page 39: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Random Variables: Distribution FunctionDistribution function: FX(x) = P({ω ∈ X : X(ω) ≤ x})

Example: number of heads in tossing 2 coins; range(X) = {0, 1, 2}.

Probability mass function (discrete RV): fX(x) = P(X = x),

FX(x) =∑xi≤x

fX(xi).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 11 / 40

Page 40: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Random Variables: Distribution FunctionDistribution function: FX(x) = P({ω ∈ X : X(ω) ≤ x})

Example: number of heads in tossing 2 coins; range(X) = {0, 1, 2}.

Probability mass function (discrete RV): fX(x) = P(X = x),

FX(x) =∑xi≤x

fX(xi).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 11 / 40

Page 41: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Random Variables: Distribution FunctionDistribution function: FX(x) = P({ω ∈ X : X(ω) ≤ x})

Example: number of heads in tossing 2 coins; range(X) = {0, 1, 2}.

Probability mass function (discrete RV): fX(x) = P(X = x),

FX(x) =∑xi≤x

fX(xi).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 11 / 40

Page 42: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Properties of Distribution Functions

FX : R→ [0, 1] is the distribution function of some r.v. X iff:

it is non-decreasing: x1 < x2 ⇒ FX(x1) ≤ FX(x2);

limx→−∞

FX(x) = 0;

limx→+∞

FX(x) = 1;

it is right semi-continuous: limx→z+

FX(x) = FX(z)

Further properties:

P(X = x) = fX(x) = FX(x)− limz→x−

FX(z);

P(z < X ≤ y) = FX(y)− FX(z);

P(X > x) = 1− FX(x).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 12 / 40

Page 43: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Properties of Distribution Functions

FX : R→ [0, 1] is the distribution function of some r.v. X iff:

it is non-decreasing: x1 < x2 ⇒ FX(x1) ≤ FX(x2);

limx→−∞

FX(x) = 0;

limx→+∞

FX(x) = 1;

it is right semi-continuous: limx→z+

FX(x) = FX(z)

Further properties:

P(X = x) = fX(x) = FX(x)− limz→x−

FX(z);

P(z < X ≤ y) = FX(y)− FX(z);

P(X > x) = 1− FX(x).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 12 / 40

Page 44: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Properties of Distribution Functions

FX : R→ [0, 1] is the distribution function of some r.v. X iff:

it is non-decreasing: x1 < x2 ⇒ FX(x1) ≤ FX(x2);

limx→−∞

FX(x) = 0;

limx→+∞

FX(x) = 1;

it is right semi-continuous: limx→z+

FX(x) = FX(z)

Further properties:

P(X = x) = fX(x) = FX(x)− limz→x−

FX(z);

P(z < X ≤ y) = FX(y)− FX(z);

P(X > x) = 1− FX(x).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 12 / 40

Page 45: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Properties of Distribution Functions

FX : R→ [0, 1] is the distribution function of some r.v. X iff:

it is non-decreasing: x1 < x2 ⇒ FX(x1) ≤ FX(x2);

limx→−∞

FX(x) = 0;

limx→+∞

FX(x) = 1;

it is right semi-continuous: limx→z+

FX(x) = FX(z)

Further properties:

P(X = x) = fX(x) = FX(x)− limz→x−

FX(z);

P(z < X ≤ y) = FX(y)− FX(z);

P(X > x) = 1− FX(x).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 12 / 40

Page 46: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Properties of Distribution Functions

FX : R→ [0, 1] is the distribution function of some r.v. X iff:

it is non-decreasing: x1 < x2 ⇒ FX(x1) ≤ FX(x2);

limx→−∞

FX(x) = 0;

limx→+∞

FX(x) = 1;

it is right semi-continuous: limx→z+

FX(x) = FX(z)

Further properties:

P(X = x) = fX(x) = FX(x)− limz→x−

FX(z);

P(z < X ≤ y) = FX(y)− FX(z);

P(X > x) = 1− FX(x).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 12 / 40

Page 47: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Properties of Distribution Functions

FX : R→ [0, 1] is the distribution function of some r.v. X iff:

it is non-decreasing: x1 < x2 ⇒ FX(x1) ≤ FX(x2);

limx→−∞

FX(x) = 0;

limx→+∞

FX(x) = 1;

it is right semi-continuous: limx→z+

FX(x) = FX(z)

Further properties:

P(X = x) = fX(x) = FX(x)− limz→x−

FX(z);

P(z < X ≤ y) = FX(y)− FX(z);

P(X > x) = 1− FX(x).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 12 / 40

Page 48: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Properties of Distribution Functions

FX : R→ [0, 1] is the distribution function of some r.v. X iff:

it is non-decreasing: x1 < x2 ⇒ FX(x1) ≤ FX(x2);

limx→−∞

FX(x) = 0;

limx→+∞

FX(x) = 1;

it is right semi-continuous: limx→z+

FX(x) = FX(z)

Further properties:

P(X = x) = fX(x) = FX(x)− limz→x−

FX(z);

P(z < X ≤ y) = FX(y)− FX(z);

P(X > x) = 1− FX(x).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 12 / 40

Page 49: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Properties of Distribution Functions

FX : R→ [0, 1] is the distribution function of some r.v. X iff:

it is non-decreasing: x1 < x2 ⇒ FX(x1) ≤ FX(x2);

limx→−∞

FX(x) = 0;

limx→+∞

FX(x) = 1;

it is right semi-continuous: limx→z+

FX(x) = FX(z)

Further properties:

P(X = x) = fX(x) = FX(x)− limz→x−

FX(z);

P(z < X ≤ y) = FX(y)− FX(z);

P(X > x) = 1− FX(x).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 12 / 40

Page 50: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Properties of Distribution Functions

FX : R→ [0, 1] is the distribution function of some r.v. X iff:

it is non-decreasing: x1 < x2 ⇒ FX(x1) ≤ FX(x2);

limx→−∞

FX(x) = 0;

limx→+∞

FX(x) = 1;

it is right semi-continuous: limx→z+

FX(x) = FX(z)

Further properties:

P(X = x) = fX(x) = FX(x)− limz→x−

FX(z);

P(z < X ≤ y) = FX(y)− FX(z);

P(X > x) = 1− FX(x).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 12 / 40

Page 51: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Important Discrete Random VariablesUniform: X ∈ {x1, ..., xK}, pmf fX(xi) = 1/K.

Bernoulli RV: X ∈ {0, 1}, pmf fX(x) =

{p ⇐ x = 1

1− p ⇐ x = 0

Can be written compactly as fX(x) = px(1− p)1−x.

Binomial RV: X ∈ {0, 1, ..., n} (sum of n Bernoulli RVs)

fX(x) = Binomial(x;n, p) =

(n

x

)px (1− p)(n−x)

Binomial coefficients(“n choose x”):(

n

x

)=

n!

(n− x)!x!

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 13 / 40

Page 52: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Important Discrete Random VariablesUniform: X ∈ {x1, ..., xK}, pmf fX(xi) = 1/K.

Bernoulli RV: X ∈ {0, 1}, pmf fX(x) =

{p ⇐ x = 1

1− p ⇐ x = 0

Can be written compactly as fX(x) = px(1− p)1−x.

Binomial RV: X ∈ {0, 1, ..., n} (sum of n Bernoulli RVs)

fX(x) = Binomial(x;n, p) =

(n

x

)px (1− p)(n−x)

Binomial coefficients(“n choose x”):(

n

x

)=

n!

(n− x)!x!

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 13 / 40

Page 53: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Important Discrete Random VariablesUniform: X ∈ {x1, ..., xK}, pmf fX(xi) = 1/K.

Bernoulli RV: X ∈ {0, 1}, pmf fX(x) =

{p ⇐ x = 1

1− p ⇐ x = 0

Can be written compactly as fX(x) = px(1− p)1−x.

Binomial RV: X ∈ {0, 1, ..., n} (sum of n Bernoulli RVs)

fX(x) = Binomial(x;n, p) =

(n

x

)px (1− p)(n−x)

Binomial coefficients(“n choose x”):(

n

x

)=

n!

(n− x)!x!

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 13 / 40

Page 54: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Important Discrete Random VariablesUniform: X ∈ {x1, ..., xK}, pmf fX(xi) = 1/K.

Bernoulli RV: X ∈ {0, 1}, pmf fX(x) =

{p ⇐ x = 1

1− p ⇐ x = 0

Can be written compactly as fX(x) = px(1− p)1−x.

Binomial RV: X ∈ {0, 1, ..., n} (sum of n Bernoulli RVs)

fX(x) = Binomial(x;n, p) =

(n

x

)px (1− p)(n−x)

Binomial coefficients(“n choose x”):(

n

x

)=

n!

(n− x)!x!

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 13 / 40

Page 55: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Other Important Discrete Random VariablesGeometric(p): X ∈ N, pmf fX(x) = p(1− p)x−1.(e.g., number of trials until the first success).

Poisson(λ): X ∈ N ∪ {0}, pmf fX(x) =e−λλx

x!

Notice that∑∞

x=0λx

x! = eλ, thus∑∞

x=0 fX(x) = 1.

“...probability of the number of independent occurrences in a fixed(time/space) interval if these occurrences have known average rate”

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 14 / 40

Page 56: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Other Important Discrete Random VariablesGeometric(p): X ∈ N, pmf fX(x) = p(1− p)x−1.(e.g., number of trials until the first success).

Poisson(λ): X ∈ N ∪ {0}, pmf fX(x) =e−λλx

x!

Notice that∑∞

x=0λx

x! = eλ, thus∑∞

x=0 fX(x) = 1.

“...probability of the number of independent occurrences in a fixed(time/space) interval if these occurrences have known average rate”

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 14 / 40

Page 57: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Random Variables: Distribution Function

Distribution function: FX(x) = P({ω ∈ X : X(ω) ≤ x})

Example: continuous RV with uniform distribution on [a, b].

Probability density function (pdf, continuous RV): fX(x)

FX(x) =

∫ x

−∞fX(u) du, P(X ∈ [c, d]) =

∫ d

c

fX(x) dx, P(X=x) = 0

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 15 / 40

Page 58: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Random Variables: Distribution Function

Distribution function: FX(x) = P({ω ∈ X : X(ω) ≤ x})

Example: continuous RV with uniform distribution on [a, b].

Probability density function (pdf, continuous RV): fX(x)

FX(x) =

∫ x

−∞fX(u) du, P(X ∈ [c, d]) =

∫ d

c

fX(x) dx, P(X=x) = 0

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 15 / 40

Page 59: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Random Variables: Distribution Function

Distribution function: FX(x) = P({ω ∈ X : X(ω) ≤ x})

Example: continuous RV with uniform distribution on [a, b].

Probability density function (pdf, continuous RV): fX(x)

FX(x) =

∫ x

−∞fX(u) du, P(X ∈ [c, d]) =

∫ d

c

fX(x) dx, P(X=x) = 0

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 15 / 40

Page 60: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Random Variables: Distribution Function

Distribution function: FX(x) = P({ω ∈ X : X(ω) ≤ x})

Example: continuous RV with uniform distribution on [a, b].

Probability density function (pdf, continuous RV): fX(x)

FX(x) =

∫ x

−∞fX(u) du,

P(X ∈ [c, d]) =

∫ d

c

fX(x) dx, P(X=x) = 0

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 15 / 40

Page 61: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Random Variables: Distribution Function

Distribution function: FX(x) = P({ω ∈ X : X(ω) ≤ x})

Example: continuous RV with uniform distribution on [a, b].

Probability density function (pdf, continuous RV): fX(x)

FX(x) =

∫ x

−∞fX(u) du, P(X ∈ [c, d]) =

∫ d

c

fX(x) dx,

P(X=x) = 0

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 15 / 40

Page 62: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Random Variables: Distribution Function

Distribution function: FX(x) = P({ω ∈ X : X(ω) ≤ x})

Example: continuous RV with uniform distribution on [a, b].

Probability density function (pdf, continuous RV): fX(x)

FX(x) =

∫ x

−∞fX(u) du, P(X ∈ [c, d]) =

∫ d

c

fX(x) dx, P(X=x) = 0

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 15 / 40

Page 63: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Important Continuous Random Variables

Uniform: fX(x) = Uniform(x; a, b) =

{1b−a ⇐ x ∈ [a, b]

0 ⇐ x 6∈ [a, b](previous slide).

Gaussian: fX(x) = N (x;µ, σ2) =1√

2π σ2e−

(x−µ)2

2σ2

Exponential: fX(x) = Exp(x;λ) =

{λe−λx ⇐ x ≥ 0

0 ⇐ x < 0

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 16 / 40

Page 64: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Important Continuous Random Variables

Uniform: fX(x) = Uniform(x; a, b) =

{1b−a ⇐ x ∈ [a, b]

0 ⇐ x 6∈ [a, b](previous slide).

Gaussian: fX(x) = N (x;µ, σ2) =1√

2π σ2e−

(x−µ)2

2σ2

Exponential: fX(x) = Exp(x;λ) =

{λe−λx ⇐ x ≥ 0

0 ⇐ x < 0

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 16 / 40

Page 65: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Important Continuous Random Variables

Uniform: fX(x) = Uniform(x; a, b) =

{1b−a ⇐ x ∈ [a, b]

0 ⇐ x 6∈ [a, b](previous slide).

Gaussian: fX(x) = N (x;µ, σ2) =1√

2π σ2e−

(x−µ)2

2σ2

Exponential: fX(x) = Exp(x;λ) =

{λe−λx ⇐ x ≥ 0

0 ⇐ x < 0

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 16 / 40

Page 66: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Expectation of Random Variables

Expectation: E(X) =

∑i

xi fX(xi) X ∈ {x1, ...xK} ⊂ R∫ ∞−∞

x fX(x) dx X continuous

Example: Bernoulli, fX(x) = px (1− p)1−x, for x ∈ {0, 1}.

E(X) = 0 (1− p) + 1 p = p.

Example: Binomial, fX(x) =(nx

)px (1− p)n−x, for x ∈ {0, ..., n}.

E(X) = n p.

Example: Gaussian, fX(x) = N (x;µ, σ2). E(X) = µ.

Linearity of expectation:E(X + Y ) = E(X) + E(Y ); E(αX) = αE(X), α ∈ R

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 17 / 40

Page 67: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Expectation of Random Variables

Expectation: E(X) =

∑i

xi fX(xi) X ∈ {x1, ...xK} ⊂ R∫ ∞−∞

x fX(x) dx X continuous

Example: Bernoulli, fX(x) = px (1− p)1−x, for x ∈ {0, 1}.

E(X) = 0 (1− p) + 1 p = p.

Example: Binomial, fX(x) =(nx

)px (1− p)n−x, for x ∈ {0, ..., n}.

E(X) = n p.

Example: Gaussian, fX(x) = N (x;µ, σ2). E(X) = µ.

Linearity of expectation:E(X + Y ) = E(X) + E(Y ); E(αX) = αE(X), α ∈ R

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 17 / 40

Page 68: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Expectation of Random Variables

Expectation: E(X) =

∑i

xi fX(xi) X ∈ {x1, ...xK} ⊂ R∫ ∞−∞

x fX(x) dx X continuous

Example: Bernoulli, fX(x) = px (1− p)1−x, for x ∈ {0, 1}.

E(X) = 0 (1− p) + 1 p = p.

Example: Binomial, fX(x) =(nx

)px (1− p)n−x, for x ∈ {0, ..., n}.

E(X) = n p.

Example: Gaussian, fX(x) = N (x;µ, σ2). E(X) = µ.

Linearity of expectation:E(X + Y ) = E(X) + E(Y ); E(αX) = αE(X), α ∈ R

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 17 / 40

Page 69: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Expectation of Random Variables

Expectation: E(X) =

∑i

xi fX(xi) X ∈ {x1, ...xK} ⊂ R∫ ∞−∞

x fX(x) dx X continuous

Example: Bernoulli, fX(x) = px (1− p)1−x, for x ∈ {0, 1}.

E(X) = 0 (1− p) + 1 p = p.

Example: Binomial, fX(x) =(nx

)px (1− p)n−x, for x ∈ {0, ..., n}.

E(X) = n p.

Example: Gaussian, fX(x) = N (x;µ, σ2). E(X) = µ.

Linearity of expectation:E(X + Y ) = E(X) + E(Y ); E(αX) = αE(X), α ∈ R

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 17 / 40

Page 70: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Expectation of Random Variables

Expectation: E(X) =

∑i

xi fX(xi) X ∈ {x1, ...xK} ⊂ R∫ ∞−∞

x fX(x) dx X continuous

Example: Bernoulli, fX(x) = px (1− p)1−x, for x ∈ {0, 1}.

E(X) = 0 (1− p) + 1 p = p.

Example: Binomial, fX(x) =(nx

)px (1− p)n−x, for x ∈ {0, ..., n}.

E(X) = n p.

Example: Gaussian, fX(x) = N (x;µ, σ2). E(X) = µ.

Linearity of expectation:E(X + Y ) = E(X) + E(Y ); E(αX) = αE(X), α ∈ R

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 17 / 40

Page 71: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Expectation of Functions of Random Variables

E(g(X)) =

∑i

g(xi)fX(xi) X discrete, g(xi) ∈ R∫ ∞−∞

g(x) fX(x) dx X continuous

Example: variance, var(X) = E((X − E(X)

)2)

= E(X2)− E(X)2

Example: Bernoulli variance, E(X2) = E(X) = p

, thus var(X) = p(1− p).

Example: Gaussian variance, E((X − µ)2

)= σ2.

Probability as expectation of indicator, 1A(x) =

{1 ⇐ x ∈ A0 ⇐ x 6∈ A

P(X ∈ A) =

∫AfX(x) dx =

∫1A(x) fX(x) dx = E(1A(X))

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 18 / 40

Page 72: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Expectation of Functions of Random Variables

E(g(X)) =

∑i

g(xi)fX(xi) X discrete, g(xi) ∈ R∫ ∞−∞

g(x) fX(x) dx X continuous

Example: variance, var(X) = E((X − E(X)

)2)

= E(X2)− E(X)2

Example: Bernoulli variance, E(X2) = E(X) = p

, thus var(X) = p(1− p).

Example: Gaussian variance, E((X − µ)2

)= σ2.

Probability as expectation of indicator, 1A(x) =

{1 ⇐ x ∈ A0 ⇐ x 6∈ A

P(X ∈ A) =

∫AfX(x) dx =

∫1A(x) fX(x) dx = E(1A(X))

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 18 / 40

Page 73: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Expectation of Functions of Random Variables

E(g(X)) =

∑i

g(xi)fX(xi) X discrete, g(xi) ∈ R∫ ∞−∞

g(x) fX(x) dx X continuous

Example: variance, var(X) = E((X − E(X)

)2)= E(X2)− E(X)2

Example: Bernoulli variance, E(X2) = E(X) = p

, thus var(X) = p(1− p).

Example: Gaussian variance, E((X − µ)2

)= σ2.

Probability as expectation of indicator, 1A(x) =

{1 ⇐ x ∈ A0 ⇐ x 6∈ A

P(X ∈ A) =

∫AfX(x) dx =

∫1A(x) fX(x) dx = E(1A(X))

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 18 / 40

Page 74: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Expectation of Functions of Random Variables

E(g(X)) =

∑i

g(xi)fX(xi) X discrete, g(xi) ∈ R∫ ∞−∞

g(x) fX(x) dx X continuous

Example: variance, var(X) = E((X − E(X)

)2)= E(X2)− E(X)2

Example: Bernoulli variance, E(X2) = E(X) = p , thus var(X) = p(1− p).

Example: Gaussian variance, E((X − µ)2

)= σ2.

Probability as expectation of indicator, 1A(x) =

{1 ⇐ x ∈ A0 ⇐ x 6∈ A

P(X ∈ A) =

∫AfX(x) dx =

∫1A(x) fX(x) dx = E(1A(X))

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 18 / 40

Page 75: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Expectation of Functions of Random Variables

E(g(X)) =

∑i

g(xi)fX(xi) X discrete, g(xi) ∈ R∫ ∞−∞

g(x) fX(x) dx X continuous

Example: variance, var(X) = E((X − E(X)

)2)= E(X2)− E(X)2

Example: Bernoulli variance, E(X2) = E(X) = p , thus var(X) = p(1− p).

Example: Gaussian variance, E((X − µ)2

)= σ2.

Probability as expectation of indicator, 1A(x) =

{1 ⇐ x ∈ A0 ⇐ x 6∈ A

P(X ∈ A) =

∫AfX(x) dx =

∫1A(x) fX(x) dx = E(1A(X))

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 18 / 40

Page 76: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Expectation of Functions of Random Variables

E(g(X)) =

∑i

g(xi)fX(xi) X discrete, g(xi) ∈ R∫ ∞−∞

g(x) fX(x) dx X continuous

Example: variance, var(X) = E((X − E(X)

)2)= E(X2)− E(X)2

Example: Bernoulli variance, E(X2) = E(X) = p , thus var(X) = p(1− p).

Example: Gaussian variance, E((X − µ)2

)= σ2.

Probability as expectation of indicator, 1A(x) =

{1 ⇐ x ∈ A0 ⇐ x 6∈ A

P(X ∈ A) =

∫AfX(x) dx =

∫1A(x) fX(x) dx = E(1A(X))

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 18 / 40

Page 77: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Two (or More) Random VariablesJoint pmf of two discrete RVs: fX,Y (x, y) = P(X = x ∧ Y = y).

Extends trivially to more than two RVs.

Joint pdf of two continuous RVs: fX,Y (x, y), such that

P((X,Y ) ∈ A

)=

∫ ∫AfX,Y (x, y) dx dy, A ∈ σ(R2)

Extends trivially to more than two RVs.

Marginalization: fY (y) =

∑x

fX,Y (x, y), if X is discrete∫ ∞−∞

fX,Y (x, y) dx, if X continuous

Independence:

X ⊥⊥ Y ⇔ fX,Y (x, y) = fX(x) fY (y)

⇒6⇐ E(X Y ) = E(X)E(Y )

.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 19 / 40

Page 78: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Two (or More) Random VariablesJoint pmf of two discrete RVs: fX,Y (x, y) = P(X = x ∧ Y = y).

Extends trivially to more than two RVs.

Joint pdf of two continuous RVs: fX,Y (x, y), such that

P((X,Y ) ∈ A

)=

∫ ∫AfX,Y (x, y) dx dy, A ∈ σ(R2)

Extends trivially to more than two RVs.

Marginalization: fY (y) =

∑x

fX,Y (x, y), if X is discrete∫ ∞−∞

fX,Y (x, y) dx, if X continuous

Independence:

X ⊥⊥ Y ⇔ fX,Y (x, y) = fX(x) fY (y)

⇒6⇐ E(X Y ) = E(X)E(Y )

.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 19 / 40

Page 79: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Two (or More) Random VariablesJoint pmf of two discrete RVs: fX,Y (x, y) = P(X = x ∧ Y = y).

Extends trivially to more than two RVs.

Joint pdf of two continuous RVs: fX,Y (x, y), such that

P((X,Y ) ∈ A

)=

∫ ∫AfX,Y (x, y) dx dy, A ∈ σ(R2)

Extends trivially to more than two RVs.

Marginalization: fY (y) =

∑x

fX,Y (x, y), if X is discrete∫ ∞−∞

fX,Y (x, y) dx, if X continuous

Independence:

X ⊥⊥ Y ⇔ fX,Y (x, y) = fX(x) fY (y)

⇒6⇐ E(X Y ) = E(X)E(Y )

.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 19 / 40

Page 80: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Two (or More) Random VariablesJoint pmf of two discrete RVs: fX,Y (x, y) = P(X = x ∧ Y = y).

Extends trivially to more than two RVs.

Joint pdf of two continuous RVs: fX,Y (x, y), such that

P((X,Y ) ∈ A

)=

∫ ∫AfX,Y (x, y) dx dy, A ∈ σ(R2)

Extends trivially to more than two RVs.

Marginalization: fY (y) =

∑x

fX,Y (x, y), if X is discrete∫ ∞−∞

fX,Y (x, y) dx, if X continuous

Independence:

X ⊥⊥ Y ⇔ fX,Y (x, y) = fX(x) fY (y)

⇒6⇐ E(X Y ) = E(X)E(Y )

.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 19 / 40

Page 81: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Two (or More) Random VariablesJoint pmf of two discrete RVs: fX,Y (x, y) = P(X = x ∧ Y = y).

Extends trivially to more than two RVs.

Joint pdf of two continuous RVs: fX,Y (x, y), such that

P((X,Y ) ∈ A

)=

∫ ∫AfX,Y (x, y) dx dy, A ∈ σ(R2)

Extends trivially to more than two RVs.

Marginalization: fY (y) =

∑x

fX,Y (x, y), if X is discrete∫ ∞−∞

fX,Y (x, y) dx, if X continuous

Independence:

X ⊥⊥ Y ⇔ fX,Y (x, y) = fX(x) fY (y)⇒6⇐ E(X Y ) = E(X)E(Y ).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 19 / 40

Page 82: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Conditionals and Bayes’ Theorem

Conditional pmf (discrete RVs):

fX|Y (x|y) = P(X = x|Y = y) =P(X = x ∧ Y = y)

P(Y = y)=fX,Y (x, y)

fY (y).

Conditional pdf (continuous RVs): fX|Y (x|y) =fX,Y (x, y)

fY (y)...the meaning is technically delicate.

Bayes’ theorem: fX|Y (x|y) =fY |X(y|x) fX(x)

fY (y)(pdf or pmf).

Also valid in the mixed case (e.g., X continuous, Y discrete).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 20 / 40

Page 83: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Conditionals and Bayes’ Theorem

Conditional pmf (discrete RVs):

fX|Y (x|y) = P(X = x|Y = y) =P(X = x ∧ Y = y)

P(Y = y)=fX,Y (x, y)

fY (y).

Conditional pdf (continuous RVs): fX|Y (x|y) =fX,Y (x, y)

fY (y)...the meaning is technically delicate.

Bayes’ theorem: fX|Y (x|y) =fY |X(y|x) fX(x)

fY (y)(pdf or pmf).

Also valid in the mixed case (e.g., X continuous, Y discrete).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 20 / 40

Page 84: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Conditionals and Bayes’ Theorem

Conditional pmf (discrete RVs):

fX|Y (x|y) = P(X = x|Y = y) =P(X = x ∧ Y = y)

P(Y = y)=fX,Y (x, y)

fY (y).

Conditional pdf (continuous RVs): fX|Y (x|y) =fX,Y (x, y)

fY (y)...the meaning is technically delicate.

Bayes’ theorem: fX|Y (x|y) =fY |X(y|x) fX(x)

fY (y)(pdf or pmf).

Also valid in the mixed case (e.g., X continuous, Y discrete).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 20 / 40

Page 85: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Conditionals and Bayes’ Theorem

Conditional pmf (discrete RVs):

fX|Y (x|y) = P(X = x|Y = y) =P(X = x ∧ Y = y)

P(Y = y)=fX,Y (x, y)

fY (y).

Conditional pdf (continuous RVs): fX|Y (x|y) =fX,Y (x, y)

fY (y)...the meaning is technically delicate.

Bayes’ theorem: fX|Y (x|y) =fY |X(y|x) fX(x)

fY (y)(pdf or pmf).

Also valid in the mixed case (e.g., X continuous, Y discrete).

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 20 / 40

Page 86: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Joint, Marginal, and Conditional Probabilities: An Example

A pair of binary variables X,Y ∈ {0, 1}, with joint pmf:

Marginals: fX(0) = 15 + 2

5 = 35 , fX(1) = 1

10 + 310 = 4

10 ,

fY (0) = 15 + 1

10 = 310 , fY (1) = 2

5 + 310 = 7

10 .

Conditional probabilities:

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 21 / 40

Page 87: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Joint, Marginal, and Conditional Probabilities: An Example

A pair of binary variables X,Y ∈ {0, 1}, with joint pmf:

Marginals: fX(0) = 15 + 2

5 = 35 , fX(1) = 1

10 + 310 = 4

10 ,

fY (0) = 15 + 1

10 = 310 , fY (1) = 2

5 + 310 = 7

10 .

Conditional probabilities:

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 21 / 40

Page 88: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Joint, Marginal, and Conditional Probabilities: An Example

A pair of binary variables X,Y ∈ {0, 1}, with joint pmf:

Marginals: fX(0) = 15 + 2

5 = 35 , fX(1) = 1

10 + 310 = 4

10 ,

fY (0) = 15 + 1

10 = 310 , fY (1) = 2

5 + 310 = 7

10 .

Conditional probabilities:

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 21 / 40

Page 89: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

An Important Multivariate RV: Multinomial

Multinomial: X = (X1, ..., XK), Xi ∈ {0, ..., n}, such that∑iXi = n,

fX(x1, ..., xK) =

{ (n

x1 x2 ··· xK

)px11 px22 · · · p

xKk ⇐

∑i xi = n

0 ⇐∑

i xi 6= n(n

x1 x2 · · · xK

)=

n!

x1!x2! · · · xK !

Parameters: p1, ..., pK ≥ 0, such that∑

i pi = 1.

Generalizes the binomial from binary to K-classes.

Example: tossing n independent fair dice, p1 = · · · = p6 = 1/6.xi = number of outcomes with i dots. Of course,

∑i xi = n.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 22 / 40

Page 90: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

An Important Multivariate RV: Multinomial

Multinomial: X = (X1, ..., XK), Xi ∈ {0, ..., n}, such that∑iXi = n,

fX(x1, ..., xK) =

{ (n

x1 x2 ··· xK

)px11 px22 · · · p

xKk ⇐

∑i xi = n

0 ⇐∑

i xi 6= n(n

x1 x2 · · · xK

)=

n!

x1!x2! · · · xK !

Parameters: p1, ..., pK ≥ 0, such that∑

i pi = 1.

Generalizes the binomial from binary to K-classes.

Example: tossing n independent fair dice, p1 = · · · = p6 = 1/6.xi = number of outcomes with i dots. Of course,

∑i xi = n.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 22 / 40

Page 91: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

An Important Multivariate RV: Multinomial

Multinomial: X = (X1, ..., XK), Xi ∈ {0, ..., n}, such that∑iXi = n,

fX(x1, ..., xK) =

{ (n

x1 x2 ··· xK

)px11 px22 · · · p

xKk ⇐

∑i xi = n

0 ⇐∑

i xi 6= n(n

x1 x2 · · · xK

)=

n!

x1!x2! · · · xK !

Parameters: p1, ..., pK ≥ 0, such that∑

i pi = 1.

Generalizes the binomial from binary to K-classes.

Example: tossing n independent fair dice, p1 = · · · = p6 = 1/6.xi = number of outcomes with i dots. Of course,

∑i xi = n.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 22 / 40

Page 92: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

An Important Multivariate RV: Gaussian

Multivariate Gaussian: X ∈ Rn,

fX(x) = N (x;µ,C) =1√

det(2π C)exp

(−1

2(x− µ)TC−1(x− µ)

)

Parameters: vector µ ∈ Rn and matrix C ∈ Rn×n.Expected value: E(X) = µ. Meaning of C: next slide.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 23 / 40

Page 93: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

An Important Multivariate RV: Gaussian

Multivariate Gaussian: X ∈ Rn,

fX(x) = N (x;µ,C) =1√

det(2π C)exp

(−1

2(x− µ)TC−1(x− µ)

)

Parameters: vector µ ∈ Rn and matrix C ∈ Rn×n.Expected value: E(X) = µ. Meaning of C: next slide.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 23 / 40

Page 94: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

An Important Multivariate RV: Gaussian

Multivariate Gaussian: X ∈ Rn,

fX(x) = N (x;µ,C) =1√

det(2π C)exp

(−1

2(x− µ)TC−1(x− µ)

)

Parameters: vector µ ∈ Rn and matrix C ∈ Rn×n.Expected value: E(X) = µ. Meaning of C: next slide.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 23 / 40

Page 95: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Covariance, Correlation, and all that...

Covariance between two RVs:

cov(X,Y ) = E[(X − E(X)

) (Y − E(Y )

)]= E(X Y )− E(X)E(Y )

Relationship with variance: var(X) = cov(X,X).

Correlation: corr(X,Y ) = ρ(X,Y ) = cov(X,Y )√var(X)

√var(Y )

∈ [−1, 1]

X ⊥⊥ Y ⇔ fX,Y (x, y) = fX(x) fY (y)

⇒6⇐ cov(X, Y ) = 0.

Covariance matrix of multivariate RV, X ∈ Rn:

cov(X) = E[(X − E(X)

)(X − E(X)

)T ]= E(XXT )− E(X)E(X)T

Covariance of Gaussian RV, fX(x) = N (x;µ,C) ⇒ cov(X) = C

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 24 / 40

Page 96: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Covariance, Correlation, and all that...

Covariance between two RVs:

cov(X,Y ) = E[(X − E(X)

) (Y − E(Y )

)]= E(X Y )− E(X)E(Y )

Relationship with variance: var(X) = cov(X,X).

Correlation: corr(X,Y ) = ρ(X,Y ) = cov(X,Y )√var(X)

√var(Y )

∈ [−1, 1]

X ⊥⊥ Y ⇔ fX,Y (x, y) = fX(x) fY (y)

⇒6⇐ cov(X, Y ) = 0.

Covariance matrix of multivariate RV, X ∈ Rn:

cov(X) = E[(X − E(X)

)(X − E(X)

)T ]= E(XXT )− E(X)E(X)T

Covariance of Gaussian RV, fX(x) = N (x;µ,C) ⇒ cov(X) = C

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 24 / 40

Page 97: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Covariance, Correlation, and all that...

Covariance between two RVs:

cov(X,Y ) = E[(X − E(X)

) (Y − E(Y )

)]= E(X Y )− E(X)E(Y )

Relationship with variance: var(X) = cov(X,X).

Correlation: corr(X,Y ) = ρ(X,Y ) = cov(X,Y )√var(X)

√var(Y )

∈ [−1, 1]

X ⊥⊥ Y ⇔ fX,Y (x, y) = fX(x) fY (y)

⇒6⇐ cov(X, Y ) = 0.

Covariance matrix of multivariate RV, X ∈ Rn:

cov(X) = E[(X − E(X)

)(X − E(X)

)T ]= E(XXT )− E(X)E(X)T

Covariance of Gaussian RV, fX(x) = N (x;µ,C) ⇒ cov(X) = C

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 24 / 40

Page 98: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Covariance, Correlation, and all that...

Covariance between two RVs:

cov(X,Y ) = E[(X − E(X)

) (Y − E(Y )

)]= E(X Y )− E(X)E(Y )

Relationship with variance: var(X) = cov(X,X).

Correlation: corr(X,Y ) = ρ(X,Y ) = cov(X,Y )√var(X)

√var(Y )

∈ [−1, 1]

X ⊥⊥ Y ⇔ fX,Y (x, y) = fX(x) fY (y)

⇒6⇐ cov(X, Y ) = 0.

Covariance matrix of multivariate RV, X ∈ Rn:

cov(X) = E[(X − E(X)

)(X − E(X)

)T ]= E(XXT )− E(X)E(X)T

Covariance of Gaussian RV, fX(x) = N (x;µ,C) ⇒ cov(X) = C

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 24 / 40

Page 99: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Covariance, Correlation, and all that...

Covariance between two RVs:

cov(X,Y ) = E[(X − E(X)

) (Y − E(Y )

)]= E(X Y )− E(X)E(Y )

Relationship with variance: var(X) = cov(X,X).

Correlation: corr(X,Y ) = ρ(X,Y ) = cov(X,Y )√var(X)

√var(Y )

∈ [−1, 1]

X ⊥⊥ Y ⇔ fX,Y (x, y) = fX(x) fY (y)⇒6⇐ cov(X, Y ) = 0.

Covariance matrix of multivariate RV, X ∈ Rn:

cov(X) = E[(X − E(X)

)(X − E(X)

)T ]= E(XXT )− E(X)E(X)T

Covariance of Gaussian RV, fX(x) = N (x;µ,C) ⇒ cov(X) = C

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 24 / 40

Page 100: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Covariance, Correlation, and all that...

Covariance between two RVs:

cov(X,Y ) = E[(X − E(X)

) (Y − E(Y )

)]= E(X Y )− E(X)E(Y )

Relationship with variance: var(X) = cov(X,X).

Correlation: corr(X,Y ) = ρ(X,Y ) = cov(X,Y )√var(X)

√var(Y )

∈ [−1, 1]

X ⊥⊥ Y ⇔ fX,Y (x, y) = fX(x) fY (y)⇒6⇐ cov(X, Y ) = 0.

Covariance matrix of multivariate RV, X ∈ Rn:

cov(X) = E[(X − E(X)

)(X − E(X)

)T ]= E(XXT )− E(X)E(X)T

Covariance of Gaussian RV, fX(x) = N (x;µ,C) ⇒ cov(X) = C

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 24 / 40

Page 101: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Covariance, Correlation, and all that...

Covariance between two RVs:

cov(X,Y ) = E[(X − E(X)

) (Y − E(Y )

)]= E(X Y )− E(X)E(Y )

Relationship with variance: var(X) = cov(X,X).

Correlation: corr(X,Y ) = ρ(X,Y ) = cov(X,Y )√var(X)

√var(Y )

∈ [−1, 1]

X ⊥⊥ Y ⇔ fX,Y (x, y) = fX(x) fY (y)⇒6⇐ cov(X, Y ) = 0.

Covariance matrix of multivariate RV, X ∈ Rn:

cov(X) = E[(X − E(X)

)(X − E(X)

)T ]= E(XXT )− E(X)E(X)T

Covariance of Gaussian RV, fX(x) = N (x;µ,C) ⇒ cov(X) = C

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 24 / 40

Page 102: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

More on Expectations and Covariances

Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.

If E(X) = µ and Y = AX, then E(Y ) = Aµ;

If E(X) = µ and Y = X − µ, then E(Y ) = 0;

If cov(X) = C and Y = AX, then cov(Y ) = ACAT ;

If cov(X) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;

If cov(X) = C and Y = C−1/2X, then cov(Y ) = I;

Combining the 2-nd and the 4-th facts is called standardization

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 25 / 40

Page 103: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

More on Expectations and Covariances

Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.

If E(X) = µ and Y = AX, then E(Y ) = Aµ;

If E(X) = µ and Y = X − µ, then E(Y ) = 0;

If cov(X) = C and Y = AX, then cov(Y ) = ACAT ;

If cov(X) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;

If cov(X) = C and Y = C−1/2X, then cov(Y ) = I;

Combining the 2-nd and the 4-th facts is called standardization

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 25 / 40

Page 104: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

More on Expectations and Covariances

Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.

If E(X) = µ and Y = AX, then E(Y ) = Aµ;

If E(X) = µ and Y = X − µ, then E(Y ) = 0;

If cov(X) = C and Y = AX, then cov(Y ) = ACAT ;

If cov(X) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;

If cov(X) = C and Y = C−1/2X, then cov(Y ) = I;

Combining the 2-nd and the 4-th facts is called standardization

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 25 / 40

Page 105: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

More on Expectations and Covariances

Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.

If E(X) = µ and Y = AX, then E(Y ) = Aµ;

If E(X) = µ and Y = X − µ, then E(Y ) = 0;

If cov(X) = C and Y = AX, then cov(Y ) = ACAT ;

If cov(X) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;

If cov(X) = C and Y = C−1/2X, then cov(Y ) = I;

Combining the 2-nd and the 4-th facts is called standardization

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 25 / 40

Page 106: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

More on Expectations and Covariances

Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.

If E(X) = µ and Y = AX, then E(Y ) = Aµ;

If E(X) = µ and Y = X − µ, then E(Y ) = 0;

If cov(X) = C and Y = AX, then cov(Y ) = ACAT ;

If cov(X) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;

If cov(X) = C and Y = C−1/2X, then cov(Y ) = I;

Combining the 2-nd and the 4-th facts is called standardization

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 25 / 40

Page 107: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

More on Expectations and Covariances

Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.

If E(X) = µ and Y = AX, then E(Y ) = Aµ;

If E(X) = µ and Y = X − µ, then E(Y ) = 0;

If cov(X) = C and Y = AX, then cov(Y ) = ACAT ;

If cov(X) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;

If cov(X) = C and Y = C−1/2X, then cov(Y ) = I;

Combining the 2-nd and the 4-th facts is called standardization

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 25 / 40

Page 108: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

More on Expectations and Covariances

Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.

If E(X) = µ and Y = AX, then E(Y ) = Aµ;

If E(X) = µ and Y = X − µ, then E(Y ) = 0;

If cov(X) = C and Y = AX, then cov(Y ) = ACAT ;

If cov(X) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;

If cov(X) = C and Y = C−1/2X, then cov(Y ) = I;

Combining the 2-nd and the 4-th facts is called standardization

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 25 / 40

Page 109: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Statistical InferenceScenario: observed RV Y , depends on unknown variable(s) X.Goal: given an observation Y = y, infer X.

Two main philosophies:Frequentist: X = x is fixed, but unknown;

Bayesian: X is a RV with pdf/pmf fX(x) (the prior)prior ⇔ knowledge about X

In both philosophies, a central object is fY |X(y|x)several names: likelihood function, observation model,...

This in not machine learning! fY,X(y, x) is assumed known.

In the Bayesian philosophy, all the knowledge about X is carried by

fX|Y (x|y) =fY |X(y|x) fX(x)

fY (y)=fY,X(y, x)

fY (y)

...the posterior (or a posteriori) pdf/pmf.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 26 / 40

Page 110: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Statistical InferenceScenario: observed RV Y , depends on unknown variable(s) X.Goal: given an observation Y = y, infer X.

Two main philosophies:Frequentist: X = x is fixed, but unknown;

Bayesian: X is a RV with pdf/pmf fX(x) (the prior)prior ⇔ knowledge about X

In both philosophies, a central object is fY |X(y|x)several names: likelihood function, observation model,...

This in not machine learning! fY,X(y, x) is assumed known.

In the Bayesian philosophy, all the knowledge about X is carried by

fX|Y (x|y) =fY |X(y|x) fX(x)

fY (y)=fY,X(y, x)

fY (y)

...the posterior (or a posteriori) pdf/pmf.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 26 / 40

Page 111: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Statistical InferenceScenario: observed RV Y , depends on unknown variable(s) X.Goal: given an observation Y = y, infer X.

Two main philosophies:Frequentist: X = x is fixed, but unknown;

Bayesian: X is a RV with pdf/pmf fX(x) (the prior)prior ⇔ knowledge about X

In both philosophies, a central object is fY |X(y|x)several names: likelihood function, observation model,...

This in not machine learning! fY,X(y, x) is assumed known.

In the Bayesian philosophy, all the knowledge about X is carried by

fX|Y (x|y) =fY |X(y|x) fX(x)

fY (y)=fY,X(y, x)

fY (y)

...the posterior (or a posteriori) pdf/pmf.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 26 / 40

Page 112: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Statistical InferenceScenario: observed RV Y , depends on unknown variable(s) X.Goal: given an observation Y = y, infer X.

Two main philosophies:Frequentist: X = x is fixed, but unknown;

Bayesian: X is a RV with pdf/pmf fX(x) (the prior)prior ⇔ knowledge about X

In both philosophies, a central object is fY |X(y|x)several names: likelihood function, observation model,...

This in not machine learning! fY,X(y, x) is assumed known.

In the Bayesian philosophy, all the knowledge about X is carried by

fX|Y (x|y) =fY |X(y|x) fX(x)

fY (y)=fY,X(y, x)

fY (y)

...the posterior (or a posteriori) pdf/pmf.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 26 / 40

Page 113: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Statistical InferenceScenario: observed RV Y , depends on unknown variable(s) X.Goal: given an observation Y = y, infer X.

Two main philosophies:Frequentist: X = x is fixed, but unknown;

Bayesian: X is a RV with pdf/pmf fX(x) (the prior)prior ⇔ knowledge about X

In both philosophies, a central object is fY |X(y|x)several names: likelihood function, observation model,...

This in not machine learning! fY,X(y, x) is assumed known.

In the Bayesian philosophy, all the knowledge about X is carried by

fX|Y (x|y) =fY |X(y|x) fX(x)

fY (y)=fY,X(y, x)

fY (y)

...the posterior (or a posteriori) pdf/pmf.Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 26 / 40

Page 114: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Statistical Inference

The posterior pdf/pmf fX|Y (x|y) has all the information/knowledgeabout X, given Y = y (conditionality principle).

How to make an optimal “guess” x about X, given this information?

Need to define “optimal”: loss function: L(x, x) ≥ 0 measures“loss”/“cost” incurred by “guessing” x if truth is x.

The optimal Bayesian decision minimizes the expected loss:

xBayes = arg minx

E[L(x, X)|Y = y]

where

E[L(x, X)|Y = y] =

∫L(x, x) fX|Y (x|y) dx, continuous (estimation)∑

x

L(x, x) fX|Y (x|y), discrete (classification)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 27 / 40

Page 115: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Statistical Inference

The posterior pdf/pmf fX|Y (x|y) has all the information/knowledgeabout X, given Y = y (conditionality principle).

How to make an optimal “guess” x about X, given this information?

Need to define “optimal”: loss function: L(x, x) ≥ 0 measures“loss”/“cost” incurred by “guessing” x if truth is x.

The optimal Bayesian decision minimizes the expected loss:

xBayes = arg minx

E[L(x, X)|Y = y]

where

E[L(x, X)|Y = y] =

∫L(x, x) fX|Y (x|y) dx, continuous (estimation)∑

x

L(x, x) fX|Y (x|y), discrete (classification)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 27 / 40

Page 116: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Statistical Inference

The posterior pdf/pmf fX|Y (x|y) has all the information/knowledgeabout X, given Y = y (conditionality principle).

How to make an optimal “guess” x about X, given this information?

Need to define “optimal”: loss function: L(x, x) ≥ 0 measures“loss”/“cost” incurred by “guessing” x if truth is x.

The optimal Bayesian decision minimizes the expected loss:

xBayes = arg minx

E[L(x, X)|Y = y]

where

E[L(x, X)|Y = y] =

∫L(x, x) fX|Y (x|y) dx, continuous (estimation)∑

x

L(x, x) fX|Y (x|y), discrete (classification)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 27 / 40

Page 117: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Statistical Inference

The posterior pdf/pmf fX|Y (x|y) has all the information/knowledgeabout X, given Y = y (conditionality principle).

How to make an optimal “guess” x about X, given this information?

Need to define “optimal”: loss function: L(x, x) ≥ 0 measures“loss”/“cost” incurred by “guessing” x if truth is x.

The optimal Bayesian decision minimizes the expected loss:

xBayes = arg minx

E[L(x, X)|Y = y]

where

E[L(x, X)|Y = y] =

∫L(x, x) fX|Y (x|y) dx, continuous (estimation)∑

x

L(x, x) fX|Y (x|y), discrete (classification)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 27 / 40

Page 118: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Classical Statistical Inference Criteria

Consider that X ∈ {1, ...,K} (discrete/classification problem).

Adopt the 0/1 loss: L(x, x) = 0, if x = x, and 1 otherwise.

Optimal Bayesian decision:

xBayes = arg minx

K∑x=1

L(x, x) fX|Y (x|y)

= arg minx

(1− fX|Y (x|y)

)= arg max

xfX|Y (x|y) ≡ xMAP

MAP = maximum a posteriori criterion.

Same criterion can be derived for continuous X

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 28 / 40

Page 119: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Classical Statistical Inference Criteria

Consider that X ∈ {1, ...,K} (discrete/classification problem).

Adopt the 0/1 loss: L(x, x) = 0, if x = x, and 1 otherwise.

Optimal Bayesian decision:

xBayes = arg minx

K∑x=1

L(x, x) fX|Y (x|y)

= arg minx

(1− fX|Y (x|y)

)= arg max

xfX|Y (x|y) ≡ xMAP

MAP = maximum a posteriori criterion.

Same criterion can be derived for continuous X

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 28 / 40

Page 120: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Classical Statistical Inference Criteria

Consider that X ∈ {1, ...,K} (discrete/classification problem).

Adopt the 0/1 loss: L(x, x) = 0, if x = x, and 1 otherwise.

Optimal Bayesian decision:

xBayes = arg minx

K∑x=1

L(x, x) fX|Y (x|y)

= arg minx

(1− fX|Y (x|y)

)= arg max

xfX|Y (x|y) ≡ xMAP

MAP = maximum a posteriori criterion.

Same criterion can be derived for continuous X

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 28 / 40

Page 121: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Classical Statistical Inference Criteria

Consider that X ∈ {1, ...,K} (discrete/classification problem).

Adopt the 0/1 loss: L(x, x) = 0, if x = x, and 1 otherwise.

Optimal Bayesian decision:

xBayes = arg minx

K∑x=1

L(x, x) fX|Y (x|y)

= arg minx

(1− fX|Y (x|y)

)= arg max

xfX|Y (x|y) ≡ xMAP

MAP = maximum a posteriori criterion.

Same criterion can be derived for continuous X

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 28 / 40

Page 122: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Classical Statistical Inference Criteria

Consider the MAP criterion xMAP = arg maxx fX|Y (x|y)

From Bayes law:

xMAP = arg maxx

fY |X(y|x) fX(x)

fY (y)= arg max

xfY |X(y|x) fX(x)

...only need to know posterior fX|Y (x|y) up to a normalization factor.

Also common to write:xMAP = arg maxx

(log fY |X(y|x) + log fX(x)

)If the prior if flat, fX(x) = C, then,

xMAP = arg maxx

fY |X(y|x) ≡ xML

ML = maximum likelihood criterion.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 29 / 40

Page 123: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Classical Statistical Inference Criteria

Consider the MAP criterion xMAP = arg maxx fX|Y (x|y)

From Bayes law:

xMAP = arg maxx

fY |X(y|x) fX(x)

fY (y)= arg max

xfY |X(y|x) fX(x)

...only need to know posterior fX|Y (x|y) up to a normalization factor.

Also common to write:xMAP = arg maxx

(log fY |X(y|x) + log fX(x)

)If the prior if flat, fX(x) = C, then,

xMAP = arg maxx

fY |X(y|x) ≡ xML

ML = maximum likelihood criterion.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 29 / 40

Page 124: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Classical Statistical Inference Criteria

Consider the MAP criterion xMAP = arg maxx fX|Y (x|y)

From Bayes law:

xMAP = arg maxx

fY |X(y|x) fX(x)

fY (y)= arg max

xfY |X(y|x) fX(x)

...only need to know posterior fX|Y (x|y) up to a normalization factor.

Also common to write:xMAP = arg maxx

(log fY |X(y|x) + log fX(x)

)

If the prior if flat, fX(x) = C, then,

xMAP = arg maxx

fY |X(y|x) ≡ xML

ML = maximum likelihood criterion.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 29 / 40

Page 125: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Classical Statistical Inference Criteria

Consider the MAP criterion xMAP = arg maxx fX|Y (x|y)

From Bayes law:

xMAP = arg maxx

fY |X(y|x) fX(x)

fY (y)= arg max

xfY |X(y|x) fX(x)

...only need to know posterior fX|Y (x|y) up to a normalization factor.

Also common to write:xMAP = arg maxx

(log fY |X(y|x) + log fX(x)

)If the prior if flat, fX(x) = C, then,

xMAP = arg maxx

fY |X(y|x) ≡ xML

ML = maximum likelihood criterion.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 29 / 40

Page 126: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Statistical Inference: Example

Observed n i.i.d. (independent identically distributed) Bernoulli RVs:

Y = (Y1, ..., Yn), with Yi ∈ {0, 1}.Common pmf fYi|X(y|x) = xy(1− x)1−y, where x ∈ [0, 1].

Likelihood function: fY |X(y1, ..., yn|x

)=

n∏i=1

xyi(1− x)1−yi

Log-likelihood function:

log fY |X(y1, ..., yn|x

)= n log(1− x) + log

x

1− x

n∑i=1

yi

Maximum likelihood: xML = arg maxx fY |X(y|x) =1

n

n∑i=1

yi

Example: n = 10, observed y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1), xML = 7/10.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 30 / 40

Page 127: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Statistical Inference: Example

Observed n i.i.d. (independent identically distributed) Bernoulli RVs:

Y = (Y1, ..., Yn), with Yi ∈ {0, 1}.Common pmf fYi|X(y|x) = xy(1− x)1−y, where x ∈ [0, 1].

Likelihood function: fY |X(y1, ..., yn|x

)=

n∏i=1

xyi(1− x)1−yi

Log-likelihood function:

log fY |X(y1, ..., yn|x

)= n log(1− x) + log

x

1− x

n∑i=1

yi

Maximum likelihood: xML = arg maxx fY |X(y|x) =1

n

n∑i=1

yi

Example: n = 10, observed y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1), xML = 7/10.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 30 / 40

Page 128: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Statistical Inference: Example

Observed n i.i.d. (independent identically distributed) Bernoulli RVs:

Y = (Y1, ..., Yn), with Yi ∈ {0, 1}.Common pmf fYi|X(y|x) = xy(1− x)1−y, where x ∈ [0, 1].

Likelihood function: fY |X(y1, ..., yn|x

)=

n∏i=1

xyi(1− x)1−yi

Log-likelihood function:

log fY |X(y1, ..., yn|x

)= n log(1− x) + log

x

1− x

n∑i=1

yi

Maximum likelihood: xML = arg maxx fY |X(y|x) =1

n

n∑i=1

yi

Example: n = 10, observed y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1), xML = 7/10.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 30 / 40

Page 129: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Statistical Inference: Example

Observed n i.i.d. (independent identically distributed) Bernoulli RVs:

Y = (Y1, ..., Yn), with Yi ∈ {0, 1}.Common pmf fYi|X(y|x) = xy(1− x)1−y, where x ∈ [0, 1].

Likelihood function: fY |X(y1, ..., yn|x

)=

n∏i=1

xyi(1− x)1−yi

Log-likelihood function:

log fY |X(y1, ..., yn|x

)= n log(1− x) + log

x

1− x

n∑i=1

yi

Maximum likelihood: xML = arg maxx fY |X(y|x) =1

n

n∑i=1

yi

Example: n = 10, observed y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1), xML = 7/10.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 30 / 40

Page 130: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Statistical Inference: Example (Continuation)Observed n i.i.d. (independent identically distributed) Bernoulli RVs.

Likelihood:

fY |X(y1, ..., yn|x

)=

n∏i=1

xyi(1− x)1−yi = x∑

i yi (1− x)n−∑

i yi

How to express knowledge that (e.g.) X is around 1/2? Convenientchoice: conjugate prior. Form of the posterior = form of the prior.

I In our case, the Beta pdffX(x) ∝ xα−1(1− x)β−1, α, β > 0

I Posterior: fX|Y (x|y) =

xα−1+∑

i yi(1− x)β−1+n−∑

i yi

I MAP: xMAP =α+

∑i yi−1

α+β+n−2

I Example: α = 4, β = 4, n = 10,y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1),

xMAP = 0.625(recall xML = 0.7

)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 31 / 40

Page 131: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Statistical Inference: Example (Continuation)Observed n i.i.d. (independent identically distributed) Bernoulli RVs.

Likelihood:

fY |X(y1, ..., yn|x

)=

n∏i=1

xyi(1− x)1−yi = x∑

i yi (1− x)n−∑

i yi

How to express knowledge that (e.g.) X is around 1/2? Convenientchoice: conjugate prior. Form of the posterior = form of the prior.

I In our case, the Beta pdffX(x) ∝ xα−1(1− x)β−1, α, β > 0

I Posterior: fX|Y (x|y) =

xα−1+∑

i yi(1− x)β−1+n−∑

i yi

I MAP: xMAP =α+

∑i yi−1

α+β+n−2

I Example: α = 4, β = 4, n = 10,y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1),

xMAP = 0.625(recall xML = 0.7

)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 31 / 40

Page 132: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Statistical Inference: Example (Continuation)Observed n i.i.d. (independent identically distributed) Bernoulli RVs.

Likelihood:

fY |X(y1, ..., yn|x

)=

n∏i=1

xyi(1− x)1−yi = x∑

i yi (1− x)n−∑

i yi

How to express knowledge that (e.g.) X is around 1/2? Convenientchoice: conjugate prior. Form of the posterior = form of the prior.

I In our case, the Beta pdffX(x) ∝ xα−1(1− x)β−1, α, β > 0

I Posterior: fX|Y (x|y) =

xα−1+∑

i yi(1− x)β−1+n−∑

i yi

I MAP: xMAP =α+

∑i yi−1

α+β+n−2

I Example: α = 4, β = 4, n = 10,y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1),

xMAP = 0.625(recall xML = 0.7

)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 31 / 40

Page 133: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Statistical Inference: Example (Continuation)Observed n i.i.d. (independent identically distributed) Bernoulli RVs.

Likelihood:

fY |X(y1, ..., yn|x

)=

n∏i=1

xyi(1− x)1−yi = x∑

i yi (1− x)n−∑

i yi

How to express knowledge that (e.g.) X is around 1/2? Convenientchoice: conjugate prior. Form of the posterior = form of the prior.

I In our case, the Beta pdffX(x) ∝ xα−1(1− x)β−1, α, β > 0

I Posterior: fX|Y (x|y) =

xα−1+∑

i yi(1− x)β−1+n−∑

i yi

I MAP: xMAP =α+

∑i yi−1

α+β+n−2

I Example: α = 4, β = 4, n = 10,y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1),

xMAP = 0.625(recall xML = 0.7

)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 31 / 40

Page 134: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Statistical Inference: Example (Continuation)Observed n i.i.d. (independent identically distributed) Bernoulli RVs.

Likelihood:

fY |X(y1, ..., yn|x

)=

n∏i=1

xyi(1− x)1−yi = x∑

i yi (1− x)n−∑

i yi

How to express knowledge that (e.g.) X is around 1/2? Convenientchoice: conjugate prior. Form of the posterior = form of the prior.

I In our case, the Beta pdffX(x) ∝ xα−1(1− x)β−1, α, β > 0

I Posterior: fX|Y (x|y) =

xα−1+∑

i yi(1− x)β−1+n−∑

i yi

I MAP: xMAP =α+

∑i yi−1

α+β+n−2

I Example: α = 4, β = 4, n = 10,y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1),

xMAP = 0.625(recall xML = 0.7

)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 31 / 40

Page 135: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Statistical Inference: Example (Continuation)Observed n i.i.d. (independent identically distributed) Bernoulli RVs.

Likelihood:

fY |X(y1, ..., yn|x

)=

n∏i=1

xyi(1− x)1−yi = x∑

i yi (1− x)n−∑

i yi

How to express knowledge that (e.g.) X is around 1/2? Convenientchoice: conjugate prior. Form of the posterior = form of the prior.

I In our case, the Beta pdffX(x) ∝ xα−1(1− x)β−1, α, β > 0

I Posterior: fX|Y (x|y) =

xα−1+∑

i yi(1− x)β−1+n−∑

i yi

I MAP: xMAP =α+

∑i yi−1

α+β+n−2

I Example: α = 4, β = 4, n = 10,y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1),

xMAP = 0.625(recall xML = 0.7

)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 31 / 40

Page 136: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Statistical Inference: Example (Continuation)Observed n i.i.d. (independent identically distributed) Bernoulli RVs.

Likelihood:

fY |X(y1, ..., yn|x

)=

n∏i=1

xyi(1− x)1−yi = x∑

i yi (1− x)n−∑

i yi

How to express knowledge that (e.g.) X is around 1/2? Convenientchoice: conjugate prior. Form of the posterior = form of the prior.

I In our case, the Beta pdffX(x) ∝ xα−1(1− x)β−1, α, β > 0

I Posterior: fX|Y (x|y) =

xα−1+∑

i yi(1− x)β−1+n−∑

i yi

I MAP: xMAP =α+

∑i yi−1

α+β+n−2

I Example: α = 4, β = 4, n = 10,y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1),

xMAP = 0.625(recall xML = 0.7

)Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 31 / 40

Page 137: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Another Classical Statistical Inference Criterion

Consider that X ∈ R (continuous/estimation problem).

Adopt the squared error loss: L(x, x) = (x− x)2

Optimal Bayesian decision:

xBayes = arg minx

E[(x−X)2|Y = y]

= arg minx

x2 − 2 xE[X|Y = y]

= E[X|Y = y] ≡ xMMSE

MMSE = minimum mean squared error criterion.

Does not apply to classification problems.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 32 / 40

Page 138: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Another Classical Statistical Inference Criterion

Consider that X ∈ R (continuous/estimation problem).

Adopt the squared error loss: L(x, x) = (x− x)2

Optimal Bayesian decision:

xBayes = arg minx

E[(x−X)2|Y = y]

= arg minx

x2 − 2 xE[X|Y = y]

= E[X|Y = y] ≡ xMMSE

MMSE = minimum mean squared error criterion.

Does not apply to classification problems.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 32 / 40

Page 139: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Another Classical Statistical Inference Criterion

Consider that X ∈ R (continuous/estimation problem).

Adopt the squared error loss: L(x, x) = (x− x)2

Optimal Bayesian decision:

xBayes = arg minx

E[(x−X)2|Y = y]

= arg minx

x2 − 2 xE[X|Y = y]

= E[X|Y = y] ≡ xMMSE

MMSE = minimum mean squared error criterion.

Does not apply to classification problems.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 32 / 40

Page 140: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Another Classical Statistical Inference Criterion

Consider that X ∈ R (continuous/estimation problem).

Adopt the squared error loss: L(x, x) = (x− x)2

Optimal Bayesian decision:

xBayes = arg minx

E[(x−X)2|Y = y]

= arg minx

x2 − 2 xE[X|Y = y]

= E[X|Y = y] ≡ xMMSE

MMSE = minimum mean squared error criterion.

Does not apply to classification problems.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 32 / 40

Page 141: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Back to the Bernoulli Example

Observed n i.i.d. (independent identically distributed) Bernoulli RVs.

Likelihood:

fY |X(y1, ..., yn|x

)=

n∏i=1

xyi(1− x)1−yi = x∑

i yi (1− x)n−∑

i yi

I In our case, the Beta pdffX(x) ∝ xα−1(1− x)β−1, α, β > 0

I Posterior: fX|Y (x|y) =

xα−1+∑

i yi(1− x)β−1+n−∑

i yi

I MMSE: xMMSE =α+

∑i yi

α+β+n

I Example: α = 4, β = 4, n = 10,y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1),

xMMSE ' 0.611(recall that xMAP = 0.625, xML = 0.7

)

Conjugate prior equivalent to “virtual” counts;often called smoothing in NLP and ML.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 33 / 40

Page 142: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Back to the Bernoulli Example

Observed n i.i.d. (independent identically distributed) Bernoulli RVs.

Likelihood:

fY |X(y1, ..., yn|x

)=

n∏i=1

xyi(1− x)1−yi = x∑

i yi (1− x)n−∑

i yi

I In our case, the Beta pdffX(x) ∝ xα−1(1− x)β−1, α, β > 0

I Posterior: fX|Y (x|y) =

xα−1+∑

i yi(1− x)β−1+n−∑

i yi

I MMSE: xMMSE =α+

∑i yi

α+β+n

I Example: α = 4, β = 4, n = 10,y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1),

xMMSE ' 0.611(recall that xMAP = 0.625, xML = 0.7

)Conjugate prior equivalent to “virtual” counts;often called smoothing in NLP and ML.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 33 / 40

Page 143: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Back to the Bernoulli Example

Observed n i.i.d. (independent identically distributed) Bernoulli RVs.

Likelihood:

fY |X(y1, ..., yn|x

)=

n∏i=1

xyi(1− x)1−yi = x∑

i yi (1− x)n−∑

i yi

I In our case, the Beta pdffX(x) ∝ xα−1(1− x)β−1, α, β > 0

I Posterior: fX|Y (x|y) =

xα−1+∑

i yi(1− x)β−1+n−∑

i yi

I MMSE: xMMSE =α+

∑i yi

α+β+n

I Example: α = 4, β = 4, n = 10,y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1),

xMMSE ' 0.611(recall that xMAP = 0.625, xML = 0.7

)Conjugate prior equivalent to “virtual” counts;often called smoothing in NLP and ML.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 33 / 40

Page 144: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Back to the Bernoulli Example

Observed n i.i.d. (independent identically distributed) Bernoulli RVs.

Likelihood:

fY |X(y1, ..., yn|x

)=

n∏i=1

xyi(1− x)1−yi = x∑

i yi (1− x)n−∑

i yi

I In our case, the Beta pdffX(x) ∝ xα−1(1− x)β−1, α, β > 0

I Posterior: fX|Y (x|y) =

xα−1+∑

i yi(1− x)β−1+n−∑

i yi

I MMSE: xMMSE =α+

∑i yi

α+β+n

I Example: α = 4, β = 4, n = 10,y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1),

xMMSE ' 0.611(recall that xMAP = 0.625, xML = 0.7

)Conjugate prior equivalent to “virtual” counts;often called smoothing in NLP and ML.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 33 / 40

Page 145: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Back to the Bernoulli Example

Observed n i.i.d. (independent identically distributed) Bernoulli RVs.

Likelihood:

fY |X(y1, ..., yn|x

)=

n∏i=1

xyi(1− x)1−yi = x∑

i yi (1− x)n−∑

i yi

I In our case, the Beta pdffX(x) ∝ xα−1(1− x)β−1, α, β > 0

I Posterior: fX|Y (x|y) =

xα−1+∑

i yi(1− x)β−1+n−∑

i yi

I MMSE: xMMSE =α+

∑i yi

α+β+n

I Example: α = 4, β = 4, n = 10,y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1),

xMMSE ' 0.611(recall that xMAP = 0.625, xML = 0.7

)Conjugate prior equivalent to “virtual” counts;often called smoothing in NLP and ML.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 33 / 40

Page 146: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Back to the Bernoulli Example

Observed n i.i.d. (independent identically distributed) Bernoulli RVs.

Likelihood:

fY |X(y1, ..., yn|x

)=

n∏i=1

xyi(1− x)1−yi = x∑

i yi (1− x)n−∑

i yi

I In our case, the Beta pdffX(x) ∝ xα−1(1− x)β−1, α, β > 0

I Posterior: fX|Y (x|y) =

xα−1+∑

i yi(1− x)β−1+n−∑

i yi

I MMSE: xMMSE =α+

∑i yi

α+β+n

I Example: α = 4, β = 4, n = 10,y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1),

xMMSE ' 0.611(recall that xMAP = 0.625, xML = 0.7

)

Conjugate prior equivalent to “virtual” counts;often called smoothing in NLP and ML.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 33 / 40

Page 147: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Back to the Bernoulli Example

Observed n i.i.d. (independent identically distributed) Bernoulli RVs.

Likelihood:

fY |X(y1, ..., yn|x

)=

n∏i=1

xyi(1− x)1−yi = x∑

i yi (1− x)n−∑

i yi

I In our case, the Beta pdffX(x) ∝ xα−1(1− x)β−1, α, β > 0

I Posterior: fX|Y (x|y) =

xα−1+∑

i yi(1− x)β−1+n−∑

i yi

I MMSE: xMMSE =α+

∑i yi

α+β+n

I Example: α = 4, β = 4, n = 10,y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1),

xMMSE ' 0.611(recall that xMAP = 0.625, xML = 0.7

)Conjugate prior equivalent to “virtual” counts;often called smoothing in NLP and ML.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 33 / 40

Page 148: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

The Bernstein-Von Mises Theorem

In the previous example, we hadn = 10, y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1), thus

∑i yi = 7.

With a Beta prior with α = 4 and β = 4, we had

xML = 0.7, xMAP =3 +

∑i yi

6 + n= 0.625, xMMSE =

4 +∑

i yi8 + n

' 0.611

Consider n = 100, and∑

i yi = 70, with the same Beta(4,4) prior

xML = 0.7, xMAP =73

106' 0.689, xMMSE =

74

108' 0.685

... both Bayesian estimates are much closer to the ML.

This illustrates an important result in Bayesian inference: theBernstein-Von Mises theorem; under (mild) conditions,

limn→∞

xMAP = limn→∞

xMMSE = xML

message: if you have a lot of data, priors don’t matter much.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 34 / 40

Page 149: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

The Bernstein-Von Mises Theorem

In the previous example, we hadn = 10, y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1), thus

∑i yi = 7.

With a Beta prior with α = 4 and β = 4, we had

xML = 0.7, xMAP =3 +

∑i yi

6 + n= 0.625, xMMSE =

4 +∑

i yi8 + n

' 0.611

Consider n = 100, and∑

i yi = 70, with the same Beta(4,4) prior

xML = 0.7, xMAP =73

106' 0.689, xMMSE =

74

108' 0.685

... both Bayesian estimates are much closer to the ML.

This illustrates an important result in Bayesian inference: theBernstein-Von Mises theorem; under (mild) conditions,

limn→∞

xMAP = limn→∞

xMMSE = xML

message: if you have a lot of data, priors don’t matter much.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 34 / 40

Page 150: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

The Bernstein-Von Mises Theorem

In the previous example, we hadn = 10, y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1), thus

∑i yi = 7.

With a Beta prior with α = 4 and β = 4, we had

xML = 0.7, xMAP =3 +

∑i yi

6 + n= 0.625, xMMSE =

4 +∑

i yi8 + n

' 0.611

Consider n = 100, and∑

i yi = 70, with the same Beta(4,4) prior

xML = 0.7, xMAP =73

106' 0.689, xMMSE =

74

108' 0.685

... both Bayesian estimates are much closer to the ML.

This illustrates an important result in Bayesian inference: theBernstein-Von Mises theorem; under (mild) conditions,

limn→∞

xMAP = limn→∞

xMMSE = xML

message: if you have a lot of data, priors don’t matter much.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 34 / 40

Page 151: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Important Inequalities

Cauchy-Schwartz’s inequality for RVs:

E(|X Y |) ≤√

E(X2)E(Y 2)

Recall that a real function g is convex if, for any x, y, and α ∈ [0, 1]

g(αx+ (1− α)y) ≤ αg(x) + (1− α)g(y)

Jensen’s inequality: if g is a real convex function, then

E(g(X)) ≥ g(E(X))

Examples: E(X)2 ≤ E(X2) ⇒ var(X) = E(X2)− E(X)2 ≥ 0.E(logX) ≤ logE(X), for X a positive RV.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 35 / 40

Page 152: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Important Inequalities

Cauchy-Schwartz’s inequality for RVs:

E(|X Y |) ≤√

E(X2)E(Y 2)

Recall that a real function g is convex if, for any x, y, and α ∈ [0, 1]

g(αx+ (1− α)y) ≤ αg(x) + (1− α)g(y)

Jensen’s inequality: if g is a real convex function, then

E(g(X)) ≥ g(E(X))

Examples: E(X)2 ≤ E(X2) ⇒ var(X) = E(X2)− E(X)2 ≥ 0.E(logX) ≤ logE(X), for X a positive RV.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 35 / 40

Page 153: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Important Inequalities

Cauchy-Schwartz’s inequality for RVs:

E(|X Y |) ≤√

E(X2)E(Y 2)

Recall that a real function g is convex if, for any x, y, and α ∈ [0, 1]

g(αx+ (1− α)y) ≤ αg(x) + (1− α)g(y)

Jensen’s inequality: if g is a real convex function, then

E(g(X)) ≥ g(E(X))

Examples: E(X)2 ≤ E(X2) ⇒ var(X) = E(X2)− E(X)2 ≥ 0.E(logX) ≤ logE(X), for X a positive RV.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 35 / 40

Page 154: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Important Inequalities

Cauchy-Schwartz’s inequality for RVs:

E(|X Y |) ≤√

E(X2)E(Y 2)

Recall that a real function g is convex if, for any x, y, and α ∈ [0, 1]

g(αx+ (1− α)y) ≤ αg(x) + (1− α)g(y)

Jensen’s inequality: if g is a real convex function, then

E(g(X)) ≥ g(E(X))

Examples: E(X)2 ≤ E(X2) ⇒ var(X) = E(X2)− E(X)2 ≥ 0.E(logX) ≤ logE(X), for X a positive RV.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 35 / 40

Page 155: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Entropy and all that...

Entropy of a discrete RV X ∈ {1, ...,K}: H(X) = −K∑x=1

fX(x) log fX(x)

Positivity: H(X) ≥ 0 ;H(X) = 0 ⇔ fX(i) = 1, for exactly one i ∈ {1, ...,K}.

Upper bound: H(X) ≤ logK ;H(X) = logK ⇔ fX(x) = 1/k, for all x ∈ {1, ...,K}

Measure of uncertainty/randomness of X

Continuous RV X, differential entropy: h(X) = −∫fX(x) log fX(x) dx

h(X) can be positive or negative. Example, iffX(x) = Uniform(x; a, b), h(X) = log(b− a).

If fX(x) = N (x;µ, σ2), then h(X) = 12 log(2πeσ2).

If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 36 / 40

Page 156: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Entropy and all that...

Entropy of a discrete RV X ∈ {1, ...,K}: H(X) = −K∑x=1

fX(x) log fX(x)

Positivity: H(X) ≥ 0 ;H(X) = 0 ⇔ fX(i) = 1, for exactly one i ∈ {1, ...,K}.

Upper bound: H(X) ≤ logK ;H(X) = logK ⇔ fX(x) = 1/k, for all x ∈ {1, ...,K}

Measure of uncertainty/randomness of X

Continuous RV X, differential entropy: h(X) = −∫fX(x) log fX(x) dx

h(X) can be positive or negative. Example, iffX(x) = Uniform(x; a, b), h(X) = log(b− a).

If fX(x) = N (x;µ, σ2), then h(X) = 12 log(2πeσ2).

If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 36 / 40

Page 157: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Entropy and all that...

Entropy of a discrete RV X ∈ {1, ...,K}: H(X) = −K∑x=1

fX(x) log fX(x)

Positivity: H(X) ≥ 0 ;H(X) = 0 ⇔ fX(i) = 1, for exactly one i ∈ {1, ...,K}.

Upper bound: H(X) ≤ logK ;H(X) = logK ⇔ fX(x) = 1/k, for all x ∈ {1, ...,K}

Measure of uncertainty/randomness of X

Continuous RV X, differential entropy: h(X) = −∫fX(x) log fX(x) dx

h(X) can be positive or negative. Example, iffX(x) = Uniform(x; a, b), h(X) = log(b− a).

If fX(x) = N (x;µ, σ2), then h(X) = 12 log(2πeσ2).

If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 36 / 40

Page 158: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Entropy and all that...

Entropy of a discrete RV X ∈ {1, ...,K}: H(X) = −K∑x=1

fX(x) log fX(x)

Positivity: H(X) ≥ 0 ;H(X) = 0 ⇔ fX(i) = 1, for exactly one i ∈ {1, ...,K}.

Upper bound: H(X) ≤ logK ;H(X) = logK ⇔ fX(x) = 1/k, for all x ∈ {1, ...,K}

Measure of uncertainty/randomness of X

Continuous RV X, differential entropy: h(X) = −∫fX(x) log fX(x) dx

h(X) can be positive or negative. Example, iffX(x) = Uniform(x; a, b), h(X) = log(b− a).

If fX(x) = N (x;µ, σ2), then h(X) = 12 log(2πeσ2).

If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 36 / 40

Page 159: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Entropy and all that...

Entropy of a discrete RV X ∈ {1, ...,K}: H(X) = −K∑x=1

fX(x) log fX(x)

Positivity: H(X) ≥ 0 ;H(X) = 0 ⇔ fX(i) = 1, for exactly one i ∈ {1, ...,K}.

Upper bound: H(X) ≤ logK ;H(X) = logK ⇔ fX(x) = 1/k, for all x ∈ {1, ...,K}

Measure of uncertainty/randomness of X

Continuous RV X, differential entropy: h(X) = −∫fX(x) log fX(x) dx

h(X) can be positive or negative. Example, iffX(x) = Uniform(x; a, b), h(X) = log(b− a).

If fX(x) = N (x;µ, σ2), then h(X) = 12 log(2πeσ2).

If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 36 / 40

Page 160: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Entropy and all that...

Entropy of a discrete RV X ∈ {1, ...,K}: H(X) = −K∑x=1

fX(x) log fX(x)

Positivity: H(X) ≥ 0 ;H(X) = 0 ⇔ fX(i) = 1, for exactly one i ∈ {1, ...,K}.

Upper bound: H(X) ≤ logK ;H(X) = logK ⇔ fX(x) = 1/k, for all x ∈ {1, ...,K}

Measure of uncertainty/randomness of X

Continuous RV X, differential entropy: h(X) = −∫fX(x) log fX(x) dx

h(X) can be positive or negative. Example, iffX(x) = Uniform(x; a, b), h(X) = log(b− a).

If fX(x) = N (x;µ, σ2), then h(X) = 12 log(2πeσ2).

If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 36 / 40

Page 161: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Entropy and all that...

Entropy of a discrete RV X ∈ {1, ...,K}: H(X) = −K∑x=1

fX(x) log fX(x)

Positivity: H(X) ≥ 0 ;H(X) = 0 ⇔ fX(i) = 1, for exactly one i ∈ {1, ...,K}.

Upper bound: H(X) ≤ logK ;H(X) = logK ⇔ fX(x) = 1/k, for all x ∈ {1, ...,K}

Measure of uncertainty/randomness of X

Continuous RV X, differential entropy: h(X) = −∫fX(x) log fX(x) dx

h(X) can be positive or negative. Example, iffX(x) = Uniform(x; a, b), h(X) = log(b− a).

If fX(x) = N (x;µ, σ2), then h(X) = 12 log(2πeσ2).

If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 36 / 40

Page 162: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Entropy and all that...

Entropy of a discrete RV X ∈ {1, ...,K}: H(X) = −K∑x=1

fX(x) log fX(x)

Positivity: H(X) ≥ 0 ;H(X) = 0 ⇔ fX(i) = 1, for exactly one i ∈ {1, ...,K}.

Upper bound: H(X) ≤ logK ;H(X) = logK ⇔ fX(x) = 1/k, for all x ∈ {1, ...,K}

Measure of uncertainty/randomness of X

Continuous RV X, differential entropy: h(X) = −∫fX(x) log fX(x) dx

h(X) can be positive or negative. Example, iffX(x) = Uniform(x; a, b), h(X) = log(b− a).

If fX(x) = N (x;µ, σ2), then h(X) = 12 log(2πeσ2).

If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 36 / 40

Page 163: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Kullback-Leibler divergence

Kullback-Leibler divergence (KLD) between two pmf:

D(fX‖gX) =

K∑x=1

fX(x) logfX(x)

gX(x)

Positivity: D(fX‖gX) ≥ 0D(fX‖gX) = 0 ⇔ fX(x) = gX(x), for x ∈ {1, ...,K}

KLD between two pdf:

D(fX‖gX) =

∫fX(x) log

fX(x)

gX(x)dx

Positivity: D(fX‖gX) ≥ 0D(fX‖gX) = 0 ⇔ fX(x) = gX(x), almost everywhere

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 37 / 40

Page 164: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Kullback-Leibler divergence

Kullback-Leibler divergence (KLD) between two pmf:

D(fX‖gX) =

K∑x=1

fX(x) logfX(x)

gX(x)

Positivity: D(fX‖gX) ≥ 0D(fX‖gX) = 0 ⇔ fX(x) = gX(x), for x ∈ {1, ...,K}

KLD between two pdf:

D(fX‖gX) =

∫fX(x) log

fX(x)

gX(x)dx

Positivity: D(fX‖gX) ≥ 0D(fX‖gX) = 0 ⇔ fX(x) = gX(x), almost everywhere

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 37 / 40

Page 165: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Kullback-Leibler divergence

Kullback-Leibler divergence (KLD) between two pmf:

D(fX‖gX) =

K∑x=1

fX(x) logfX(x)

gX(x)

Positivity: D(fX‖gX) ≥ 0D(fX‖gX) = 0 ⇔ fX(x) = gX(x), for x ∈ {1, ...,K}

KLD between two pdf:

D(fX‖gX) =

∫fX(x) log

fX(x)

gX(x)dx

Positivity: D(fX‖gX) ≥ 0D(fX‖gX) = 0 ⇔ fX(x) = gX(x), almost everywhere

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 37 / 40

Page 166: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Kullback-Leibler divergence

Kullback-Leibler divergence (KLD) between two pmf:

D(fX‖gX) =

K∑x=1

fX(x) logfX(x)

gX(x)

Positivity: D(fX‖gX) ≥ 0D(fX‖gX) = 0 ⇔ fX(x) = gX(x), for x ∈ {1, ...,K}

KLD between two pdf:

D(fX‖gX) =

∫fX(x) log

fX(x)

gX(x)dx

Positivity: D(fX‖gX) ≥ 0D(fX‖gX) = 0 ⇔ fX(x) = gX(x), almost everywhere

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 37 / 40

Page 167: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Mutual information

Mutual information (MI) between two random variables:

I(X;Y ) = D(fX,Y ‖fX fY

)

Positivity: I(X;Y ) ≥ 0I(X;Y ) = 0 ⇔ X,Y are independent.

MI is a measure of dependency between two random variables

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 38 / 40

Page 168: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Mutual information

Mutual information (MI) between two random variables:

I(X;Y ) = D(fX,Y ‖fX fY

)Positivity: I(X;Y ) ≥ 0

I(X;Y ) = 0 ⇔ X,Y are independent.

MI is a measure of dependency between two random variables

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 38 / 40

Page 169: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Mutual information

Mutual information (MI) between two random variables:

I(X;Y ) = D(fX,Y ‖fX fY

)Positivity: I(X;Y ) ≥ 0

I(X;Y ) = 0 ⇔ X,Y are independent.

MI is a measure of dependency between two random variables

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 38 / 40

Page 170: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Other Stuff

Note covered, but also very important for machine learning:

Exponential families,

Basic inequalities (Markov, Chebyshev, etc...)

Stochastic processes (Markov chains, hidden Markov models,...)

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 39 / 40

Page 171: Probability Theory Refresher - LxMLS 2020lxmls.it.pt/2016/Lecture_0.pdf · Probability Theory Refresher M ario A. T. Figueiredo Instituto Superior T ecnico & Instituto de Telecomunica˘c~oes

Recommended Reading (Probability and Statistics)

K. Murphy, “Machine Learning: A Probabilistic Perspective”, MITPress, 2012 (Chapter 2).

L. Wasserman, “All of Statistics: A Concise Course in StatisticalInference”, Springer, 2004.

Mario A. T. Figueiredo (IST & IT) LxMLS 2016: Probability Theory July 21, 2016 40 / 40