probability theory - github pages · probability of seeing a stopping time of exactly 3.141596?...

PROBABILITY THEORY CMPUT 296

Martha White

Winter, 2020

QUICK CHECK ON PRE-REQ KNOWLEDGE

• You should know how to take derivatives • Will rarely, if ever integrate • I will teach you about Partial Derivatives

• I assume you know about vectors and dot products, hopefully also about matrices

• Need to have learned about probability, I will cover • Expected Value (Mean) • Variance • Random Variables • Probability Density Functions…

WHY DO WE NEED PROBABILITIES?

• We could just assume a deterministic world • I see an input x, I can produce the output y = f(x)

• Example: in a game (e.g., chess), you take an action, and the outcome is deterministic

• But, even in a deterministic world, we have a problem: partial observability

• Outcomes look random because we don’t have enough information

• Example: Imagine a (high-tech) gumball machine, where f(x = has candy, battery charged) = output candy

• You can only see if it has candy

WHY DO WE NEED PROBABILITIES?

• But, even in a deterministic world, we have a problem: partial observability

• Outcomes look random because we don’t have enough information

• Example: Imagine a (high-tech) gumball machine, where f(x = has candy, battery charged) = output candy

• You can only see if it has candy • From your perspective, when x = has candy,

sometimes candy is outputted and sometimes not • Looks stochastic (dependent on hidden battery

charge)

SPACE OF OUTCOMES AND EVENTS

E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>


SAMPLE SPACES

e.g.,F = {;, {1, 2}, {3, 4, 5, 6}, {1, 2, 3, 4, 5, 6}} e.g.,F = {;, [0, 0.5], (0.5, 1.0], [0, 1]}E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>


e.g., Outcome of Dice Roll

A FEW COMMENTS ON TERMINOLOGY

• A few new terms, including countable, closure • only a small amount of terminology used, can

google these terms and learn on your own • notation sheet in notes

• Countable: integers, {0.1,2.0,3.6},… • Uncountable: real numbers, intervals, … • Interchangeably I use (though its somewhat loose)

• discrete and countable • continuous and uncountable

(MEASURABLE) SPACE OF OUTCOMES AND EVENTS

an event space









WHY IS THIS THE DEFINITION?Intuitively,

1. A collection of outcomes is an event (e.g., either a 1 or 6 was rolled)

2. If we can measure two events separately, then their union should also bea measurable event

3. If we can measure an event, then we should be able to measure that thatevent did not occur (the complement)






QUICK CHECK ON BACKGROUND

• The complement of a set

• The union of sets

• A set of sets

• Any other confusing notation?

SAMPLE SPACES

e.g.,F = {;, {1, 2}, {3, 4, 5, 6}, {1, 2, 3, 4, 5, 6}} e.g.,F = {;, [0, 0.5], (0.5, 1.0], [0, 1]}



E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit> E

<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>

AXIOMS OF PROBABILITY

1. (unit measure) P (⌦) = 1

2. (�-additivity) Any countable sequence of disjoint eventsA1, A2, . . . 2 F satisfies P ([1

i=1Ai) =P1

i=1 P (Ai)





FINDING PROBABILITY DISTRIBUTIONS



Do you recognize this distribution?

PROBABILITY MASS FUNCTIONS



ARBITRARY PMFS

e.g. PMF for a fair die (table of values)⌦ = {1, 2, 3, 4, 5, 6}

p(!) = 1/6 8! 2 ⌦

EXERCISE: EXAMPLES OF EVENTS

• What is a possible event? What is its probability

• What is the event space?

⌦ = {1, 2, 3, 4, 5, 6}p(!) = 1/6 8! 2 ⌦

EXERCISE: HOW ARE PMFS USEFUL AS A MODEL?

• Histograms!

• Imagine you recorded your commute times for a year, in minutes (365 recorded times)

• How do you get p(t)?

• How is p(t) useful?

EXERCISE: HOW ARE PMFS USEFUL AS A MODEL?• Histograms!

• Imagine you recorded your commute times for a year, in minutes

• Get p(t): count number of times t = 1, 2, 3, … occurs and then normalize probabilities by # samples

USEFUL PMFS

REMINDERS: JANUARY 9, 2020

• Assignment 1 is available on the website

• https://marthawhite.github.io/mlbasics/

• You should start reading the notes

• Chapters 1, 2 and 3 (about 30 pages)

• The notes are in-progress

• Some sections still say “Coming soon…”

• Avoid printing the full set of notes (if at all). At most, print what you are reading

https://marthawhite.github.io/mlbasics/

A FEW OTHER NOTES ON POLICY

• If you have an issue (sickness, injury, family problem, etc.), and need an extension on an assignment, contact me about it before the assignment deadline.

• I cannot give extensions after.

• You cannot submit Thought Questions late

• only Assignments have a late policy that allows for late submission

• if you submit 1 hour late, we won’t penalize you

USEFUL PMFS

e.g., amount of mail received in a day number of calls received by call center in an hour

USEFUL PMFS

e.g., amount of mail received in a day number of calls received by call center in an hour

1. Can we use a table for this? 2. How do we know this is a valid

pmf? How can you check?

EXERCISE: CAN WE USE A POISSON FOR COMMUTE TIMES?• Used a probability table (histogram) for minutes:

count number of times t = 1, 2, 3, … occurs and then normalize probabilities by # samples

• Can we use a Poisson?

Why Poisson? Why not just a histogram?

PROBABILITY DENSITY FUNCTIONS


Who has never seen an integral?



e.g. normal distribution (Gaussian)




Who has never seen an integral?

PMFS VS. PDFS

Example:

- Stopping time of a car, in interval [3,15]. What is the probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability of stopping between 3 to 3.5 seconds. How do we get that probability?


USEFUL PDFS

WHY CAN THE DENSITY BE ABOVE 1?

p(x) can be big, because delta x is small P(A) MUST be less than or equal to 1 p(x) can be bigger than 1

RANDOM VARIABLES

Age: 35 Likes sports: Yes Height: 1.85m Smokes: No Weight: 75kg Marital st.: Single IQ: 104 Occupation: Musician

Age: 26 Likes sports: Yes Height: 1.75m Smokes: No Weight: 79kg Marital st.: Divorced IQ: 103 Occupation: Athlete

Musician is a random variable (a function) A is a new event, let’s call it 1; Not-A is 0 Omega is {0, 1} Can ask P(M = 0) and P(M = 1)


WE INSTINCTIVELY CREATE THIS TRANSFORMATION

Assume ⌦ is a set of people.

Compute the probability that a randomly selected person ! 2 ⌦ has a cold.

Define event A = {! 2 ⌦ : Disease(!) = cold}.

Disease is our new random variable, P (Disease = cold)

Disease is a function that maps outcome space to new outcome space {cold, not cold}

Disease is a function, which is neither a variable nor random BUT, this term is still a good one since we treat Disease as a variable And assume it can take on different values (randomly according to some distribution)

RANDOM VARIABLE: FORMAL DEFINITION

Example X : ⌦ ! [0,1)

⌦ is set of (measured) people in population

with associated measurements such as height and weight

X(!) = height

A = interval = [50100, 50200]

P (X 2 A) = P (50100 X 50200) = P ({! : X(!) 2 A})



3 MINUTE BREAK AND EXERCISE

• Let X be a random variable that corresponds to the ratio of hard-to-easy problems on an assignment. Assume it takes values in {0.1, 0.25, 0.7}.

• Is this discrete or continuous? Does it have a PMF or PDF?

• Further, where could the variability come from? i.e., why is this a random variable?

• Let X be the stopping time of a car, taking values in [3,5] union [7,9]. Is this discrete or continuous?

• Think of an example of a discrete random variable (RV) and a continuous RV

WHAT IF WE HAVE MORE THAN TWO VARIABLES…

• So far, we have considered scalar random variables

• Axioms of probability defined abstractly, apply to vector random variables

⌦ = R2, e.g., ! = [�0.5, 10]

⌦ = [0, 1]⇥ [2, 5], e.g., ! = [0.2, 3.5]

But, when defining probabilities, we will want to consider how the variables interact


TWO DISCRETE RANDOM VARIABLES

Random variables X and YOutcome spaces X and Y

answer is to actually take the expected value (mean) of this gamma distribution, which wedefine below in Section 1.4. ⇤

1.3 Multivariate random variablesMuch of the above development extends to multivariate random variables—a vector of ran-dom variables—because the definition of outcome spaces and probabilities is general. Theexamples so far, however, have dealt with scalar random variables, because for multivariaterandom variables, we need to understand how variables interact. In this section, we discussseveral new notions that only arise when there are multiple random variables, includingjoint distributions, conditional distributions, marginals and dependence between variables.

Let us start with a simpler example, with two discrete random variables X and Y withoutcome spaces X and Y. There is a joint probability mass function p : X ◊ Y æ [0, 1], andcorresponding joint probability distribution P , such that

p(x, y) = P (X = x, Y = y)

where the pmf needs to satisfyÿ

xœX

ÿ

yœY

p(x, y) = 1.

For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of probabilities

Y0 1

X0 1/2 1/1001 1/10 39/100

This fits within the definition of probability spaces, because � = X ◊ Y is a valid space,and

qÊœ� p(Ê) =

q(x,y)œ� p(x, y) =

qxœX

qyœY

p(x, y). We consider the random variableZ = (X, Y ) to be a multivariate random variable, of two dimensions.

Looking at the joint probabilities in the table, we can definitely see that the two randomvariables interact. The joint probability is small for young and having arthritis, for example.Further, there seems to be more magnitude in the rows corresponding to young, suggestingthat the probabilities are also influenced by the proportion of people in the population thatare old or young. In fact, one might ask if we can figure out this proportion just from thistable.

The answer is a resounding yes, and leads us to marginal distributions and why wemight care about marginal distributions. Given a joint distribution over random variables,one would hope that we could extract more specific probabilities, like the distribution overjust one of those variables, which is called the marginal distribution. The marginal can besimply computed, by summing up over all values of the other variable

P (X = young) = p(young, no arthritis) + p(young, arthritis) = 51100 .

A young person either does or does not have arthritis, so summing up over these twopossible cases factors out that variable. Therefore, using data collected for random variable

21




p(x, y) = P (X = x, Y = y)


xœX

ÿ

yœY

p(x, y) = 1.


Y0 1

X0 1/2 1/1001 1/10 39/100


qÊœ� p(Ê) =

q(x,y)œ� p(x, y) =

qxœX

qyœY






21




p(x, y) = P (X = x, Y = y)


xœX

ÿ

yœY

p(x, y) = 1.


Y0 1

X0 1/2 1/1001 1/10 39/100


qÊœ� p(Ê) =

q(x,y)œ� p(x, y) =

qxœX

qyœY






21

* these numbers are completely made up

about your commute time tomorrow. For this setting, your random variable X correspondsto the commute time, and you need to define probabilities for this random variable. Thisdata could be considered to be discrete, taking values, in minutes, {4, 5, 6, . . . , 26}. Youcould then create histograms of this data (table of probability values), as shown in Figure1.4, to reflect the likelihood of commute times.

The commute time, however, is not actually discrete, and so you would like to modelit as a continuous RV. One reasonable choice is a gamma distribution. How, though, doesone take the recorded data and determine the parameters –, — to the gamma distribution?Estimating these parameters is actually quite straightforward, though not as immediatelyobvious as estimating tables of probability values; we discuss how to do so in Chapter 3.The learned gamma distribution is also depicted in Figure 1.4.

Given the gamma distribution, one could now ask the question: what is the most likelycommute time today? This corresponds to maxÊ p(Ê), which is called the mode of thedistribution. Another natural question is the average or expected commute time. To obtainthis, you need the expected value (mean) of this gamma distribution, which we define belowin Section 1.4. ⇤



p(x, y) def= P (X = x, Y = y)

where the pmf needs to satisfy ÿ

xœX

ÿ

yœY

p(x, y) = 1.

For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of joint probabilities This fits within the definition of probability spaces, because

Y0 1

X0 P (X=0, Y =0) = 1/2 P (X=0, Y =1) = 1/1001 P (X=1, Y =0) = 1/10 P (X=1, Y =1) = 39/100

Table 1.1: A joint probability table for random variables X and Y .

� = X ◊ Y is a valid space, andq

Êœ� p(Ê) =q

(x,y)œ� p(x, y) =q

xœX

qyœY

p(x, y). Therandom variable Z = (X, Y ) is a multivariate random variable, of two dimensions.

21

SOME QUESTIONS WE MIGHT ASK NOW THAT WE HAVE TWO RANDOM VARIABLES




p(x, y) = P (X = x, Y = y)


xœX

ÿ

yœY

p(x, y) = 1.


Y0 1

X0 1/2 1/1001 1/10 39/100


qÊœ� p(Ê) =

q(x,y)œ� p(x, y) =

qxœX

qyœY






21




p(x, y) = P (X = x, Y = y)


xœX

ÿ

yœY

p(x, y) = 1.


Y0 1

X0 1/2 1/1001 1/10 39/100


qÊœ� p(Ê) =

q(x,y)œ� p(x, y) =

qxœX

qyœY






21

Are these two variables related? Or do they change completely independently of each other?

Given this joint distribution, can we determine just the distribution over arthritis? i.e., P(Y = 1)? (Marginal distribution)

If we knew something about one of the variables, say that the person Is young, do we know the distribution over Y? (Conditional distribution)

EXAMPLE: MARGINAL AND CONDITIONAL DISTRIBUTION




p(x, y) = P (X = x, Y = y)


xœX

ÿ

yœY

p(x, y) = 1.


Y0 1

X0 1/2 1/1001 1/10 39/100


qÊœ� p(Ê) =

q(x,y)œ� p(x, y) =

qxœX

qyœY






21




p(x, y) = P (X = x, Y = y)


xœX

ÿ

yœY

p(x, y) = 1.


Y0 1

X0 1/2 1/1001 1/10 39/100


qÊœ� p(Ê) =

q(x,y)œ� p(x, y) =

qxœX

qyœY






21

P(Y = 1) = P(Y = 1, X = 0) + P(Y = 1, X = 1) = 40/100 What is P(Y = 0)?

P(X = 1) = 49/100. So what is P(X = 0)?

P(Y = 1 | X = 0) = ? Is it 1/100, where the table tells us P(Y = 1, X=0)?

No

CONDITIONAL DISTRIBUTIONS

p(y|x) = p(x, y)

p(x)

If p(x,y) is small, does this imply that p(y|x) is small?

EXERCISE: CONDITIONAL DISTRIBUTION




p(x, y) = P (X = x, Y = y)


xœX

ÿ

yœY

p(x, y) = 1.


Y0 1

X0 1/2 1/1001 1/10 39/100


qÊœ� p(Ê) =

q(x,y)œ� p(x, y) =

qxœX

qyœY






21




p(x, y) = P (X = x, Y = y)


xœX

ÿ

yœY

p(x, y) = 1.


Y0 1

X0 1/2 1/1001 1/10 39/100


qÊœ� p(Ê) =

q(x,y)œ� p(x, y) =

qxœX

qyœY






21

P(Y = 1 | X = 0) = ?

What is P(Y = 0 | X = 0)? Should P(Y = 1 | X = 0) + P(Y = 0 | X = 0) = 1?

p(y|x) = p(x, y)

p(x)

JOINT DISTRIBUTIONS FOR MANY VARIABLESZ = (X, Y ), we can determine the proportion of the population that is young and theproportion that is old.

In general, we can consider d-dimensional random variable X = (X1, X2, . . . , Xd) withvector-valued outcomes x = (x1, x2, . . . , xd), such that each xi is chosen from some Xi.Then, for the discrete case, any function p : X1 ◊ X2 ◊ . . . ◊ Xd æ [0, 1] is called a multidi-mensional probability mass function if

ÿ

x1œX1

ÿ

x2œX2

· · ·

ÿ

xdœX d

p (x1, x2, . . . , xd) = 1.

or, for the continuous case, p : X1 ◊X2 ◊ . . .◊Xd æ [0, Œ] is a multidimensional probabilitydensity function if

X1 X2· · ·

X d

p (x1, x2, . . . , xd) dx1dx2 . . . dxd = 1.

A marginal distribution is defined for a subset of X = (X1, X2, . . . , Xd) by summing orintegrating over the remaining variables. For the discrete case, the marginal distributionp (xi) is defined as

p (xi) =ÿ

x1œX1

· · ·

ÿ

xi≠1œXi≠1

ÿ

xi+1œXi+1

· · ·

ÿ

xdœXd

p (x1, . . . , xi≠1, xi, xi+1, . . . , xd) ,

where the variable xi is fixed to some value and we sum over all possible values of the othervariables. Similarly, for the continuous case, the marginal distribution p (xi) is defined as

p (xi) =X1

· · ·

Xi≠1 Xi+1

· · ·

Xd

p (x1, . . . , xi≠1, xi, xi+1, . . . , xd) dx1 . . . dxi≠1dxi+1 . . . dxd.

Notice that we use p to define the density over x, but then we overload this terminology andalso use p for the density only over xi. To be more precise, we should define two separatefunctions (pdfs), say px for the density over the multivariate random variable and pxi

forthe marginal. It is common, however, to simply use p, and infer the random variable fromcontext. In most cases, it is clear; if it is not, we will explicitly highlight the pdfs withadditional subscripts.

We can define common multivariate pmfs and pdfs, that are extensions of the scalarpmfs and pdfs. The most useful extensions include table of probability values, uniformdistributions and Gaussian distributions. We define these concretely in the final sectionof this chapter, Section 1.5, for reference. Before we provide these definitions, however,it will be useful to understand how multiple variables interact, because this will influencethe extension from univariate to multivariate. In particular, it will be useful to understandconditional distributions and dependence, which we discuss next.

1.3.1 Conditional distributionsConditional probabilities define probabilities of a random variable X, given informationabout the value of another random variable Y . More formally, the conditional probabilityp(y|x) for two random variables X and Y is defined as

p(y|x) = p(x, y)p(x) (1.1)

22

MARGINAL DISTRIBUTIONS

Z = (X, Y ), we can determine the proportion of the population that is young and theproportion that is old.

In general, we can consider d-dimensional random variable X = (X1, X2, . . . , Xd) withvector-valued outcomes x = (x1, x2, . . . , xd), such that each xi is chosen from some Xi.Then, for the discrete case, any function p : X1 ◊ X2 ◊ . . . ◊ Xd æ [0, 1] is called a multidi-mensional probability mass function if

ÿ

x1œX1

ÿ

x2œX2

· · ·

ÿ

xdœX d

p (x1, x2, . . . , xd) = 1.

or, for the continuous case, p : X1 ◊X2 ◊ . . .◊Xd æ [0, Œ] is a multidimensional probabilitydensity function if

X1 X2· · ·

X d

p (x1, x2, . . . , xd) dx1dx2 . . . dxd = 1.

A marginal distribution is defined for a subset of X = (X1, X2, . . . , Xd) by summing orintegrating over the remaining variables. For the discrete case, the marginal distributionp (xi) is defined as

p (xi) =ÿ

x1œX1

· · ·

ÿ

xi≠1œXi≠1

ÿ

xi+1œXi+1

· · ·

ÿ

xdœXd

p (x1, . . . , xi≠1, xi, xi+1, . . . , xd) ,

where the variable xi is fixed to some value and we sum over all possible values of the othervariables. Similarly, for the continuous case, the marginal distribution p (xi) is defined as

p (xi) =X1

· · ·

Xi≠1 Xi+1

· · ·

Xd

p (x1, . . . , xi≠1, xi, xi+1, . . . , xd) dx1 . . . dxi≠1dxi+1 . . . dxd.

Notice that we use p to define the density over x, but then we overload this terminology andalso use p for the density only over xi. To be more precise, we should define two separatefunctions (pdfs), say px for the density over the multivariate random variable and pxi

forthe marginal. It is common, however, to simply use p, and infer the random variable fromcontext. In most cases, it is clear; if it is not, we will explicitly highlight the pdfs withadditional subscripts.

We can define common multivariate pmfs and pdfs, that are extensions of the scalarpmfs and pdfs. The most useful extensions include table of probability values, uniformdistributions and Gaussian distributions. We define these concretely in the final sectionof this chapter, Section 1.5, for reference. Before we provide these definitions, however,it will be useful to understand how multiple variables interact, because this will influencethe extension from univariate to multivariate. In particular, it will be useful to understandconditional distributions and dependence, which we discuss next.

1.3.1 Conditional distributionsConditional probabilities define probabilities of a random variable X, given informationabout the value of another random variable Y . More formally, the conditional probabilityp(y|x) for two random variables X and Y is defined as

p(y|x) = p(x, y)p(x) (1.1)

22

Natural question: Why do you use p for p(xi) and for p(x1, …., xd)? They have different domains, they can’t be the same function!

DROPPING SUBSCRIPTS

Instead of:

We will write: p(y|x) = p(x, y)

p(x)

ANOTHER EXAMPLE FOR CONDITIONAL DISTRIBUTIONS

• Let X be a Bernoulli random variable (i.e., 0 or 1 with probability alpha)

• Let Y be a random variable in {10, 11, …, 1000}

• p(y | X = 0) and p(y | X = 1) are different distributions

• Two types of books: fiction (X=0) and non-fiction (X=1)

• Let Y correspond to number of pages

• Distribution over number of pages different for fiction and non-fiction books (e.g., average different)

EXAMPLE CONTINUED

• Two types of books: fiction (X=0) and non-fiction (X=1) • Y corresponds to number of pages • If most books are non-fiction, p(X = 0, y) is small even if

y is a likely number of pages for a fiction book • p(X = 0) accounts for the fact that joint probability small

if p(X = 0) is small • p(y | X = 0) = p(X = 0, y)/p(X = 0) • p(X = 0 , y) = probability that a book is fiction and has

y pages (imagine randomly sampling a book) • p(X = 0) = probability that a book is fiction

ANOTHER EXAMPLE

• Two types of books: fiction (X=0) and non-fiction (X=1) • Let Y be a random variable over the reals, which

corresponds to amount of money made • p(y | X = 0) and p(y | X = 1) are different distributions • e.g., even if both p(y | X = 0) and p(y | X = 1) are

Gaussian, they likely have different means and variances

THINK-PAIR-SHARE (5 MINUTES)• How might you use a given Poisson distribution, that

models commute times? (Hint: recall modes)

• How might you pick lambda for a Poisson distribution, to model commute times? (Hint: the mean of a Poisson is lambda)

Review so far

• PMFs (discrete) and PDFs (continuous) • Joint probabilities • Marginals • Conditional probabilities

50

CHAIN RULE

Two variable example p(x, y) = p(x|y)p(y) = p(y|x)p(x)

p(x1, . . . , xk) = p(xk)k�1Y

i=1

p(xi|xi+1, . . . , xk)

= p(x1)kY

i=2

p(xi|x1, . . . , xi�1)<latexit sha1_base64="RVK/3RBZsHi67FgwqVrXklW5t6A=">AAACdHicbZFNS8MwGMfT+jbn29SDBz1Ex0RRRzMEvQhDLx4VnA7WWdI0m2FpU5JUHLWfwG/nzY/hxbPp1oPbfCDw5/9/fnl54secKe04X5Y9N7+wuFRaLq+srq1vVDa3HpVIJKEtIriQbR8ryllEW5ppTtuxpDj0OX3yBzd5/vRKpWIietDDmHZD3I9YjxGsjeVVPuKjNw+dQpcHQqtT+OYNjuHhFcxto9xYisBL2RXKntPBGcpGAYPvpjFlJyibJF23XLDoD9vI2QlykkuZ2diwXqXq1J1RwVmBClEFRd15lU83ECQJaaQJx0p1kBPrboqlZoTTrOwmisaYDHCfdoyMcEhVNx0NLYM14wSwJ6RZkYYj9y+R4lCpYeibzhDrFzWd5eZ/WSfRvctuyqI40TQi44N6CYdawPwHYMAkJZoPjcBEMnNXSF6wxESbfyqbIaDpJ8+Kx0YdOXV0f15tXhfjKIFdcACOAAIXoAluwR1oAQK+rR0LWvvWj71nV+3auNW2CmYbTJRd/wUg1Liq</latexit><latexit sha1_base64="RVK/3RBZsHi67FgwqVrXklW5t6A=">AAACdHicbZFNS8MwGMfT+jbn29SDBz1Ex0RRRzMEvQhDLx4VnA7WWdI0m2FpU5JUHLWfwG/nzY/hxbPp1oPbfCDw5/9/fnl54secKe04X5Y9N7+wuFRaLq+srq1vVDa3HpVIJKEtIriQbR8ryllEW5ppTtuxpDj0OX3yBzd5/vRKpWIietDDmHZD3I9YjxGsjeVVPuKjNw+dQpcHQqtT+OYNjuHhFcxto9xYisBL2RXKntPBGcpGAYPvpjFlJyibJF23XLDoD9vI2QlykkuZ2diwXqXq1J1RwVmBClEFRd15lU83ECQJaaQJx0p1kBPrboqlZoTTrOwmisaYDHCfdoyMcEhVNx0NLYM14wSwJ6RZkYYj9y+R4lCpYeibzhDrFzWd5eZ/WSfRvctuyqI40TQi44N6CYdawPwHYMAkJZoPjcBEMnNXSF6wxESbfyqbIaDpJ8+Kx0YdOXV0f15tXhfjKIFdcACOAAIXoAluwR1oAQK+rR0LWvvWj71nV+3auNW2CmYbTJRd/wUg1Liq</latexit><latexit sha1_base64="RVK/3RBZsHi67FgwqVrXklW5t6A=">AAACdHicbZFNS8MwGMfT+jbn29SDBz1Ex0RRRzMEvQhDLx4VnA7WWdI0m2FpU5JUHLWfwG/nzY/hxbPp1oPbfCDw5/9/fnl54secKe04X5Y9N7+wuFRaLq+srq1vVDa3HpVIJKEtIriQbR8ryllEW5ppTtuxpDj0OX3yBzd5/vRKpWIietDDmHZD3I9YjxGsjeVVPuKjNw+dQpcHQqtT+OYNjuHhFcxto9xYisBL2RXKntPBGcpGAYPvpjFlJyibJF23XLDoD9vI2QlykkuZ2diwXqXq1J1RwVmBClEFRd15lU83ECQJaaQJx0p1kBPrboqlZoTTrOwmisaYDHCfdoyMcEhVNx0NLYM14wSwJ6RZkYYj9y+R4lCpYeibzhDrFzWd5eZ/WSfRvctuyqI40TQi44N6CYdawPwHYMAkJZoPjcBEMnNXSF6wxESbfyqbIaDpJ8+Kx0YdOXV0f15tXhfjKIFdcACOAAIXoAluwR1oAQK+rR0LWvvWj71nV+3auNW2CmYbTJRd/wUg1Liq</latexit><latexit sha1_base64="RVK/3RBZsHi67FgwqVrXklW5t6A=">AAACdHicbZFNS8MwGMfT+jbn29SDBz1Ex0RRRzMEvQhDLx4VnA7WWdI0m2FpU5JUHLWfwG/nzY/hxbPp1oPbfCDw5/9/fnl54secKe04X5Y9N7+wuFRaLq+srq1vVDa3HpVIJKEtIriQbR8ryllEW5ppTtuxpDj0OX3yBzd5/vRKpWIietDDmHZD3I9YjxGsjeVVPuKjNw+dQpcHQqtT+OYNjuHhFcxto9xYisBL2RXKntPBGcpGAYPvpjFlJyibJF23XLDoD9vI2QlykkuZ2diwXqXq1J1RwVmBClEFRd15lU83ECQJaaQJx0p1kBPrboqlZoTTrOwmisaYDHCfdoyMcEhVNx0NLYM14wSwJ6RZkYYj9y+R4lCpYeibzhDrFzWd5eZ/WSfRvctuyqI40TQi44N6CYdawPwHYMAkJZoPjcBEMnNXSF6wxESbfyqbIaDpJ8+Kx0YdOXV0f15tXhfjKIFdcACOAAIXoAluwR1oAQK+rR0LWvvWj71nV+3auNW2CmYbTJRd/wUg1Liq</latexit>

CHAIN RULE

Three variable example

p(x, y, z) = p(y|x, z)p(x, z) = p(y|x, z)p(x|z)p(z)= p(x|y, z)p(y|z)p(z)= p(x|y, z)p(z|y)p(y)

<latexit sha1_base64="9QsCkcafLHX5+w5FPFQWzQOTcY4=">AAACZnicfZHNS8MwGMbT+rXVr7khHrwEh7LBGK0IehGGXjxOcB+wjZFm6RaWpiVJZd3cP+nNsxf/DNOuB93EF0J/eZ73JclTN2RUKtv+MMyt7Z3dvVze2j84PDounBTbMogEJi0csEB0XSQJo5y0FFWMdENBkO8y0nGnj4nfeSVC0oC/qDgkAx+NOfUoRkpLw8IyrMxqMK7BeRVe3cOwEsM3OEu3qaO/G6rmFObVft9KhxIpzuz4f3uecNqn7WGhbNfttOAmOBmUQVbNYeG9Pwpw5BOuMENS9hw7VIMFEopiRpZWP5IkRHiKxqSnkSOfyMEijWkJL7Uygl4g9OIKpurPiQXypYx9V3f6SE3kupeIf3m9SHl3gwXlYaQIx6uDvIhBFcAkcziigmDFYg0IC6rvCvEECYSV/jOWDsFZf/ImtK/rjl13nm/KjYcsjhw4BxegAhxwCxrgCTRBC2DwaeSNolEyvswj89Q8W7WaRjZTAr/KhN9Ez6xo</latexit><latexit sha1_base64="9QsCkcafLHX5+w5FPFQWzQOTcY4=">AAACZnicfZHNS8MwGMbT+rXVr7khHrwEh7LBGK0IehGGXjxOcB+wjZFm6RaWpiVJZd3cP+nNsxf/DNOuB93EF0J/eZ73JclTN2RUKtv+MMyt7Z3dvVze2j84PDounBTbMogEJi0csEB0XSQJo5y0FFWMdENBkO8y0nGnj4nfeSVC0oC/qDgkAx+NOfUoRkpLw8IyrMxqMK7BeRVe3cOwEsM3OEu3qaO/G6rmFObVft9KhxIpzuz4f3uecNqn7WGhbNfttOAmOBmUQVbNYeG9Pwpw5BOuMENS9hw7VIMFEopiRpZWP5IkRHiKxqSnkSOfyMEijWkJL7Uygl4g9OIKpurPiQXypYx9V3f6SE3kupeIf3m9SHl3gwXlYaQIx6uDvIhBFcAkcziigmDFYg0IC6rvCvEECYSV/jOWDsFZf/ImtK/rjl13nm/KjYcsjhw4BxegAhxwCxrgCTRBC2DwaeSNolEyvswj89Q8W7WaRjZTAr/KhN9Ez6xo</latexit><latexit sha1_base64="9QsCkcafLHX5+w5FPFQWzQOTcY4=">AAACZnicfZHNS8MwGMbT+rXVr7khHrwEh7LBGK0IehGGXjxOcB+wjZFm6RaWpiVJZd3cP+nNsxf/DNOuB93EF0J/eZ73JclTN2RUKtv+MMyt7Z3dvVze2j84PDounBTbMogEJi0csEB0XSQJo5y0FFWMdENBkO8y0nGnj4nfeSVC0oC/qDgkAx+NOfUoRkpLw8IyrMxqMK7BeRVe3cOwEsM3OEu3qaO/G6rmFObVft9KhxIpzuz4f3uecNqn7WGhbNfttOAmOBmUQVbNYeG9Pwpw5BOuMENS9hw7VIMFEopiRpZWP5IkRHiKxqSnkSOfyMEijWkJL7Uygl4g9OIKpurPiQXypYx9V3f6SE3kupeIf3m9SHl3gwXlYaQIx6uDvIhBFcAkcziigmDFYg0IC6rvCvEECYSV/jOWDsFZf/ImtK/rjl13nm/KjYcsjhw4BxegAhxwCxrgCTRBC2DwaeSNolEyvswj89Q8W7WaRjZTAr/KhN9Ez6xo</latexit><latexit sha1_base64="9QsCkcafLHX5+w5FPFQWzQOTcY4=">AAACZnicfZHNS8MwGMbT+rXVr7khHrwEh7LBGK0IehGGXjxOcB+wjZFm6RaWpiVJZd3cP+nNsxf/DNOuB93EF0J/eZ73JclTN2RUKtv+MMyt7Z3dvVze2j84PDounBTbMogEJi0csEB0XSQJo5y0FFWMdENBkO8y0nGnj4nfeSVC0oC/qDgkAx+NOfUoRkpLw8IyrMxqMK7BeRVe3cOwEsM3OEu3qaO/G6rmFObVft9KhxIpzuz4f3uecNqn7WGhbNfttOAmOBmUQVbNYeG9Pwpw5BOuMENS9hw7VIMFEopiRpZWP5IkRHiKxqSnkSOfyMEijWkJL7Uygl4g9OIKpurPiQXypYx9V3f6SE3kupeIf3m9SHl3gwXlYaQIx6uDvIhBFcAkcziigmDFYg0IC6rvCvEECYSV/jOWDsFZf/ImtK/rjl13nm/KjYcsjhw4BxegAhxwCxrgCTRBC2DwaeSNolEyvswj89Q8W7WaRjZTAr/KhN9Ez6xo</latexit>

. . .

p(x1, . . . , xk) = p(xk)k�1Y

i=1

p(xi|xi+1, . . . , xk)<latexit sha1_base64="LMJGMgMfsBXQ42t5fd2istmQ/is=">AAACNXicbVDLSgMxFM3UV62vqks3wSJUrGUigm4KRTcuXFSwD2jrkMmkbWhmMiQZaRnnp9z4H6504UIRt/6C6WNhqwcCh3PO5eYeN+RMadt+tVILi0vLK+nVzNr6xuZWdnunpkQkCa0SwYVsuFhRzgJa1Uxz2gglxb7Lad3tX478+j2ViongVg9D2vZxN2AdRrA2kpO9DvMDBxVgi3tCqwIcOP1DWIIj1ZBWKIXnxKyEkru4f4ySscHgg8nF7Agls4NONmcX7THgX4KmJAemqDjZ55YnSOTTQBOOlWoiO9TtGEvNCKdJphUpGmLSx13aNDTAPlXteHx1Ag+M4sGOkOYFGo7V3xMx9pUa+q5J+lj31Lw3Ev/zmpHunLdjFoSRpgGZLOpEHGoBRxVCj0lKNB8agolk5q+Q9LDERJuiM6YENH/yX1I7KSK7iG5Oc+WLaR1psAf2QR4gcAbK4ApUQBUQ8AhewDv4sJ6sN+vT+ppEU9Z0ZhfMwPr+Af93qJM=</latexit><latexit sha1_base64="LMJGMgMfsBXQ42t5fd2istmQ/is=">AAACNXicbVDLSgMxFM3UV62vqks3wSJUrGUigm4KRTcuXFSwD2jrkMmkbWhmMiQZaRnnp9z4H6504UIRt/6C6WNhqwcCh3PO5eYeN+RMadt+tVILi0vLK+nVzNr6xuZWdnunpkQkCa0SwYVsuFhRzgJa1Uxz2gglxb7Lad3tX478+j2ViongVg9D2vZxN2AdRrA2kpO9DvMDBxVgi3tCqwIcOP1DWIIj1ZBWKIXnxKyEkru4f4ySscHgg8nF7Agls4NONmcX7THgX4KmJAemqDjZ55YnSOTTQBOOlWoiO9TtGEvNCKdJphUpGmLSx13aNDTAPlXteHx1Ag+M4sGOkOYFGo7V3xMx9pUa+q5J+lj31Lw3Ev/zmpHunLdjFoSRpgGZLOpEHGoBRxVCj0lKNB8agolk5q+Q9LDERJuiM6YENH/yX1I7KSK7iG5Oc+WLaR1psAf2QR4gcAbK4ApUQBUQ8AhewDv4sJ6sN+vT+ppEU9Z0ZhfMwPr+Af93qJM=</latexit><latexit sha1_base64="LMJGMgMfsBXQ42t5fd2istmQ/is=">AAACNXicbVDLSgMxFM3UV62vqks3wSJUrGUigm4KRTcuXFSwD2jrkMmkbWhmMiQZaRnnp9z4H6504UIRt/6C6WNhqwcCh3PO5eYeN+RMadt+tVILi0vLK+nVzNr6xuZWdnunpkQkCa0SwYVsuFhRzgJa1Uxz2gglxb7Lad3tX478+j2ViongVg9D2vZxN2AdRrA2kpO9DvMDBxVgi3tCqwIcOP1DWIIj1ZBWKIXnxKyEkru4f4ySscHgg8nF7Agls4NONmcX7THgX4KmJAemqDjZ55YnSOTTQBOOlWoiO9TtGEvNCKdJphUpGmLSx13aNDTAPlXteHx1Ag+M4sGOkOYFGo7V3xMx9pUa+q5J+lj31Lw3Ev/zmpHunLdjFoSRpgGZLOpEHGoBRxVCj0lKNB8agolk5q+Q9LDERJuiM6YENH/yX1I7KSK7iG5Oc+WLaR1psAf2QR4gcAbK4ApUQBUQ8AhewDv4sJ6sN+vT+ppEU9Z0ZhfMwPr+Af93qJM=</latexit><latexit sha1_base64="LMJGMgMfsBXQ42t5fd2istmQ/is=">AAACNXicbVDLSgMxFM3UV62vqks3wSJUrGUigm4KRTcuXFSwD2jrkMmkbWhmMiQZaRnnp9z4H6504UIRt/6C6WNhqwcCh3PO5eYeN+RMadt+tVILi0vLK+nVzNr6xuZWdnunpkQkCa0SwYVsuFhRzgJa1Uxz2gglxb7Lad3tX478+j2ViongVg9D2vZxN2AdRrA2kpO9DvMDBxVgi3tCqwIcOP1DWIIj1ZBWKIXnxKyEkru4f4ySscHgg8nF7Agls4NONmcX7THgX4KmJAemqDjZ55YnSOTTQBOOlWoiO9TtGEvNCKdJphUpGmLSx13aNDTAPlXteHx1Ag+M4sGOkOYFGo7V3xMx9pUa+q5J+lj31Lw3Ev/zmpHunLdjFoSRpgGZLOpEHGoBRxVCj0lKNB8agolk5q+Q9LDERJuiM6YENH/yX1I7KSK7iG5Oc+WLaR1psAf2QR4gcAbK4ApUQBUQ8AhewDv4sJ6sN+vT+ppEU9Z0ZhfMwPr+Af93qJM=</latexit>

HOW DO WE GET BAYES RULE?

Recall chain rule: p(x, y) = p(x|y)p(y) = p(y|x)p(x)

p(y|x) = p(x|y)p(y)p(x)

Bayes rule:

EXAMPLE: DRUG TEST

• Imagine p(DT = True | User = True) = 0.99 • Imagine p(User = True) = 0.005 • Imagine p(DT = True | User = False) = 0.01 • What is p(User = True | DT = True)?

REMINDERS: JANUARY 14, 2020

• Thought Questions 1 due on January 23 • Chapters 1-3 • You should be reading now • if you read the material before I lecture, it will (a)

help you understand a lot better and (b) be easier to write coherent thought questions

• Assignment 1 due on January 30 • Any questions?

INDEPENDENCE OF RANDOM VARIABLES

We will drop subscripts and write p(x, y) = p(x)p(y)

p(x, y|z) = p(x|z)p(y|z)

CONDITIONAL INDEPENDENCE EXAMPLES EXAMPLE 7 IN THE NOTES

• Imagine you have a biased coin (does not flip 50% heads and 50% tails, but skewed towards one)

• Let Z = bias of a coin (say outcomes are 0.3, 0.5, 0.8 with associated probabilities 0.7, 0.2, 0.1)

• what other outcome space could we consider? • what kinds of distributions?

• Let X and Y be consecutive flips of the coin • Are X and Y independent? • Are X and Y conditionally independent, given Z?

**(Basic example about an important issue in ML: hidden variables)

CONDITIONAL INDEPENDENCE IS RELATIVE TO DISTRIBUTION YOU PICK, NOT OBJECTIVE FOR ALL DISTRIBUTIONS

• Explained on whiteboard • What is p(X) and p(X | Z)

CONDITIONAL INDEPENDENCE EXAMPLES EXAMPLE 7 IN THE NOTES

• Are X and Y independent? (don’t know Z) • Are X and Y conditionally independent, given Z?

**(Basic example about an important issue in ML: hidden variables)

Z

X Y

0 or 1 0 or 1

z 0.3 0.5 0.8

p(z) 0.7 0.2 0.1

bias

probability of that bias

Imagine don’t know Z and flip two 0s. Does that tell you anything about Z?

p(X,Y |Z) = p(X|Z)p(Y |Z)?<latexit sha1_base64="+7KGq3ZWR9HAbJVhuQO6VAvHe6M=">AAACBnicbVDLSgMxFM3UV62vUZciBItQQcqMCLoRi25cVrAv26Fk0kwbmsmEJCOUtis3/oobF4q49Rvc+Tdm2llo64GQk3Pu5eYeXzCqtON8W5mFxaXllexqbm19Y3PL3t6pqiiWmFRwxCJZ95EijHJS0VQzUheSoNBnpOb3rxO/9kCkohG/0wNBvBB1OQ0oRtpIbXtfFOrHDTiC90fwAprHyBBRaCTXZa5t552iMwGcJ25K8iBFuW1/tToRjkPCNWZIqabrCO0NkdQUMzLOtWJFBMJ91CVNQzkKifKGkzXG8NAoHRhE0hyu4UT93TFEoVKD0DeVIdI9Nesl4n9eM9bBuTekXMSacDwdFMQM6ggmmcAOlQRrNjAEYUnNXyHuIYmwNsklIbizK8+T6knRdYru7Wm+dJXGkQV74AAUgAvOQAncgDKoAAwewTN4BW/Wk/VivVsf09KMlfbsgj+wPn8AqCiVZg==</latexit><latexit sha1_base64="+7KGq3ZWR9HAbJVhuQO6VAvHe6M=">AAACBnicbVDLSgMxFM3UV62vUZciBItQQcqMCLoRi25cVrAv26Fk0kwbmsmEJCOUtis3/oobF4q49Rvc+Tdm2llo64GQk3Pu5eYeXzCqtON8W5mFxaXllexqbm19Y3PL3t6pqiiWmFRwxCJZ95EijHJS0VQzUheSoNBnpOb3rxO/9kCkohG/0wNBvBB1OQ0oRtpIbXtfFOrHDTiC90fwAprHyBBRaCTXZa5t552iMwGcJ25K8iBFuW1/tToRjkPCNWZIqabrCO0NkdQUMzLOtWJFBMJ91CVNQzkKifKGkzXG8NAoHRhE0hyu4UT93TFEoVKD0DeVIdI9Nesl4n9eM9bBuTekXMSacDwdFMQM6ggmmcAOlQRrNjAEYUnNXyHuIYmwNsklIbizK8+T6knRdYru7Wm+dJXGkQV74AAUgAvOQAncgDKoAAwewTN4BW/Wk/VivVsf09KMlfbsgj+wPn8AqCiVZg==</latexit><latexit sha1_base64="+7KGq3ZWR9HAbJVhuQO6VAvHe6M=">AAACBnicbVDLSgMxFM3UV62vUZciBItQQcqMCLoRi25cVrAv26Fk0kwbmsmEJCOUtis3/oobF4q49Rvc+Tdm2llo64GQk3Pu5eYeXzCqtON8W5mFxaXllexqbm19Y3PL3t6pqiiWmFRwxCJZ95EijHJS0VQzUheSoNBnpOb3rxO/9kCkohG/0wNBvBB1OQ0oRtpIbXtfFOrHDTiC90fwAprHyBBRaCTXZa5t552iMwGcJ25K8iBFuW1/tToRjkPCNWZIqabrCO0NkdQUMzLOtWJFBMJ91CVNQzkKifKGkzXG8NAoHRhE0hyu4UT93TFEoVKD0DeVIdI9Nesl4n9eM9bBuTekXMSacDwdFMQM6ggmmcAOlQRrNjAEYUnNXyHuIYmwNsklIbizK8+T6knRdYru7Wm+dJXGkQV74AAUgAvOQAncgDKoAAwewTN4BW/Wk/VivVsf09KMlfbsgj+wPn8AqCiVZg==</latexit><latexit sha1_base64="+7KGq3ZWR9HAbJVhuQO6VAvHe6M=">AAACBnicbVDLSgMxFM3UV62vUZciBItQQcqMCLoRi25cVrAv26Fk0kwbmsmEJCOUtis3/oobF4q49Rvc+Tdm2llo64GQk3Pu5eYeXzCqtON8W5mFxaXllexqbm19Y3PL3t6pqiiWmFRwxCJZ95EijHJS0VQzUheSoNBnpOb3rxO/9kCkohG/0wNBvBB1OQ0oRtpIbXtfFOrHDTiC90fwAprHyBBRaCTXZa5t552iMwGcJ25K8iBFuW1/tToRjkPCNWZIqabrCO0NkdQUMzLOtWJFBMJ91CVNQzkKifKGkzXG8NAoHRhE0hyu4UT93TFEoVKD0DeVIdI9Nesl4n9eM9bBuTekXMSacDwdFMQM6ggmmcAOlQRrNjAEYUnNXyHuIYmwNsklIbizK8+T6knRdYru7Wm+dJXGkQV74AAUgAvOQAncgDKoAAwewTN4BW/Wk/VivVsf09KMlfbsgj+wPn8AqCiVZg==</latexit>

p(X,Y ) = p(X)p(Y )?<latexit sha1_base64="bl6zlgp6j49zYrr3WzDcmkNPH7w=">AAAB/nicbZDLSgMxFIbP1Futt1Fx5SZYhBakzIigG7HoxmUFe5F2KJk0bUMzmSHJCGUo+CpuXCji1udw59uYaWehrQcSPv7/HHLy+xFnSjvOt5VbWl5ZXcuvFzY2t7Z37N29hgpjSWidhDyULR8rypmgdc00p61IUhz4nDb90U3qNx+pVCwU93ocUS/AA8H6jGBtpK59EJVaJw9ldIkMlM1l+KrQtYtOxZkWWgQ3gyJkVevaX51eSOKACk04VqrtOpH2Eiw1I5xOCp1Y0QiTER7QtkGBA6q8ZLr+BB0bpYf6oTRHaDRVf08kOFBqHPimM8B6qOa9VPzPa8e6f+ElTESxpoLMHurHHOkQpVmgHpOUaD42gIlkZldEhlhiok1iaQju/JcXoXFacZ2Ke3dWrF5nceThEI6gBC6cQxVuoQZ1IJDAM7zCm/VkvVjv1sesNWdlM/vwp6zPHxcfklQ=</latexit><latexit sha1_base64="bl6zlgp6j49zYrr3WzDcmkNPH7w=">AAAB/nicbZDLSgMxFIbP1Futt1Fx5SZYhBakzIigG7HoxmUFe5F2KJk0bUMzmSHJCGUo+CpuXCji1udw59uYaWehrQcSPv7/HHLy+xFnSjvOt5VbWl5ZXcuvFzY2t7Z37N29hgpjSWidhDyULR8rypmgdc00p61IUhz4nDb90U3qNx+pVCwU93ocUS/AA8H6jGBtpK59EJVaJw9ldIkMlM1l+KrQtYtOxZkWWgQ3gyJkVevaX51eSOKACk04VqrtOpH2Eiw1I5xOCp1Y0QiTER7QtkGBA6q8ZLr+BB0bpYf6oTRHaDRVf08kOFBqHPimM8B6qOa9VPzPa8e6f+ElTESxpoLMHurHHOkQpVmgHpOUaD42gIlkZldEhlhiok1iaQju/JcXoXFacZ2Ke3dWrF5nceThEI6gBC6cQxVuoQZ1IJDAM7zCm/VkvVjv1sesNWdlM/vwp6zPHxcfklQ=</latexit><latexit sha1_base64="bl6zlgp6j49zYrr3WzDcmkNPH7w=">AAAB/nicbZDLSgMxFIbP1Futt1Fx5SZYhBakzIigG7HoxmUFe5F2KJk0bUMzmSHJCGUo+CpuXCji1udw59uYaWehrQcSPv7/HHLy+xFnSjvOt5VbWl5ZXcuvFzY2t7Z37N29hgpjSWidhDyULR8rypmgdc00p61IUhz4nDb90U3qNx+pVCwU93ocUS/AA8H6jGBtpK59EJVaJw9ldIkMlM1l+KrQtYtOxZkWWgQ3gyJkVevaX51eSOKACk04VqrtOpH2Eiw1I5xOCp1Y0QiTER7QtkGBA6q8ZLr+BB0bpYf6oTRHaDRVf08kOFBqHPimM8B6qOa9VPzPa8e6f+ElTESxpoLMHurHHOkQpVmgHpOUaD42gIlkZldEhlhiok1iaQju/JcXoXFacZ2Ke3dWrF5nceThEI6gBC6cQxVuoQZ1IJDAM7zCm/VkvVjv1sesNWdlM/vwp6zPHxcfklQ=</latexit><latexit sha1_base64="bl6zlgp6j49zYrr3WzDcmkNPH7w=">AAAB/nicbZDLSgMxFIbP1Futt1Fx5SZYhBakzIigG7HoxmUFe5F2KJk0bUMzmSHJCGUo+CpuXCji1udw59uYaWehrQcSPv7/HHLy+xFnSjvOt5VbWl5ZXcuvFzY2t7Z37N29hgpjSWidhDyULR8rypmgdc00p61IUhz4nDb90U3qNx+pVCwU93ocUS/AA8H6jGBtpK59EJVaJw9ldIkMlM1l+KrQtYtOxZkWWgQ3gyJkVevaX51eSOKACk04VqrtOpH2Eiw1I5xOCp1Y0QiTER7QtkGBA6q8ZLr+BB0bpYf6oTRHaDRVf08kOFBqHPimM8B6qOa9VPzPa8e6f+ElTESxpoLMHurHHOkQpVmgHpOUaD42gIlkZldEhlhiok1iaQju/JcXoXFacZ2Ke3dWrF5nceThEI6gBC6cQxVuoQZ1IJDAM7zCm/VkvVjv1sesNWdlM/vwp6zPHxcfklQ=</latexit>

EXPECTED VALUE (MEAN, AVERAGE)

E [X] =

8><

>:

Px2X xp(x) X : discrete

RX xp(x)dx X : continuous

Exercise: Biased coin with p(Heads) = 0.8

EXPECTATIONS WITH FUNCTIONS1.2.5 Expectations and moments

Expectations of functions are defined as sums (or integrals) of function values weighted ac-cording to the probability mass (or density) function. Given a probability space (X ,B(X ), PX),we consider a function f : X ! C and define its expectation function as

E [f(X)] =

8><

>:

Px2X

f(x)p(x) X : discrete

´Xf(x)p(x)dx X : continuous

Note that we use a capital X for the random variable (with f(X) the random variabletransformed by f) and lower case x when it is an instance (e.g., p(x) is the probability of aspecific outcome). The capital X in E [f(X)] also specifies that the summation (integration)is defined over p(x); this will become more important later when we consider multiple randomvariables. It can happen that E [f(X)] = ±1; in such cases we say that the expectationdoes not exist or is not well-defined.4 For f(x) = x, we have a standard expectationE [X] =

Pxp(x), or the mean value of X. Using f(x) = xk results in the k-th moment,

f(x) = log 1/p(x) gives the well-known entropy function H(X), or differential entropy forcontinuous random variables, and f(x) = (x� E [X])2 provides the variance of a randomvariable X, denoted by V [X]. Interestingly, the probability of some event A ✓ X

5 can alsobe expressed in the form of expectation; i.e.,

P (A) = E [1(X 2 A)] ,

where

1(t) =

(1 t is true0 t is false

(1.5)

is an indicator function. With this, it is possible to express the cumulative distributionfunction as FX(t) = E[1(X 2 (�1, t])].

Function f(x) inside the expectation can also be complex-valued. For example, 'X(t) =E[eitX ], where i is the imaginary unit, defines the characteristic function of X. The char-acteristic function is closely related to the inverse Fourier transform of p(x) and is useful inmany forms of statistical inference. Several expectation functions are summarized in Table1.1.

Given two random variables X and Y and a specific value x assigned to X, we definethe conditional expectation as

E [f(Y )|x] =

8><

>:

Py2Y

f(y)p(y|x) Y : discrete

´Yf(y)p(y|x)dy Y : continuous

4There is sometimes disagreement on terminology, and some definitions allow the expected value to be

infinite, which, for example, still allows the strong law of large numbers. In that setting, an expectation is

not well-defined only if both left and right improper integrals are infinite. For our purposes, this is splitting

hairs.5This notation is a bit loose; we should say A 2 B(X ) instead of A ✓ X . We used it to re-emphasize

(through this footnote) that some subsets of continuous sets are not measurable.

25

f : X ! R

f(x)p(x)

Exercise: Imagine you get 10 dollars if a Heads is flipped, lose 3 if Tails.

CONDITIONAL EXPECTATIONS

E [Y |X = x] =

8><

>:

Py2Y yp(y|x) Y : discrete

RY yp(y|x)dy Y : continuous

Different expected value, depending on which x is observed

3 MINUTE BREAK

E [Y |X = x] =

8><

>:

Py2Y yp(y|x) Y : discrete

RY yp(y|x)dy Y : continuous

What is the E[Y | x] for the following? - p(y | x) = Gaussian with N(mu = x, sigma^2 = 10) - p(y | x) = Gaussian with N(mu = f(x), sigma^2 = 0.1) - p(y | x) = Bernoulli with alpha = x

Turn to your neighbour and discuss a question that you have If you have nothing, consider the questions below

PROPERTIES OF EXPECTATIONS

• E[cX] = c E[X], for a constant c • E[X + Y] = E[X] + E[Y] (linearity of expectation) • If X and Y independent, then E[XY] = E[X] E[Y] • E[Y] = E[E[Y | X]], where outer expectation over X

• called Law of Total Expectation • (prove these on the whiteboard)

VARIANCE

Variance(X) = E[(X � E[X])2]

= E[X2]� E[X]2<latexit sha1_base64="ZsVW4Q/gat7XW9uJrdGkK/5obYc=">AAACSXicbVBNSwMxFMzW7/pV9eglWJR6sOwWQS+CKIJHBVsXumvNpq8ams0uyVuxLP17Xrx58z948aCIJ9Pag1YHApOZeeRlolQKg6777BQmJqemZ2bnivMLi0vLpZXVhkkyzaHOE5loP2IGpFBQR4ES/FQDiyMJl1H3eOBf3oE2IlEX2EshjNmNEh3BGVqpVboOEO4xbzAtmOLQr/jbdOuABjHD2yjKT/rNik93flz9cPuqFgYBpcXfOd/KY8mrGqWtUtmtukPQv8QbkTIZ4axVegraCc9iUMglM6bpuSmGOdMouIR+McgMpIx32Q00LVUsBhPmwyb6dNMqbdpJtD0K6VD9OZGz2JheHNnkYFEz7g3E/7xmhp39MBcqzRAU/36ok0mKCR3USttCA0fZs4RxLeyulN8yzTja8ou2BG/8y39Jo1b13Kp3vls+PBrVMUvWyQapEI/skUNySs5InXDyQF7IG3l3Hp1X58P5/I4WnNHMGvmFwsQXn3qvtg==</latexit><latexit sha1_base64="ZsVW4Q/gat7XW9uJrdGkK/5obYc=">AAACSXicbVBNSwMxFMzW7/pV9eglWJR6sOwWQS+CKIJHBVsXumvNpq8ams0uyVuxLP17Xrx58z948aCIJ9Pag1YHApOZeeRlolQKg6777BQmJqemZ2bnivMLi0vLpZXVhkkyzaHOE5loP2IGpFBQR4ES/FQDiyMJl1H3eOBf3oE2IlEX2EshjNmNEh3BGVqpVboOEO4xbzAtmOLQr/jbdOuABjHD2yjKT/rNik93flz9cPuqFgYBpcXfOd/KY8mrGqWtUtmtukPQv8QbkTIZ4axVegraCc9iUMglM6bpuSmGOdMouIR+McgMpIx32Q00LVUsBhPmwyb6dNMqbdpJtD0K6VD9OZGz2JheHNnkYFEz7g3E/7xmhp39MBcqzRAU/36ok0mKCR3USttCA0fZs4RxLeyulN8yzTja8ou2BG/8y39Jo1b13Kp3vls+PBrVMUvWyQapEI/skUNySs5InXDyQF7IG3l3Hp1X58P5/I4WnNHMGvmFwsQXn3qvtg==</latexit><latexit sha1_base64="ZsVW4Q/gat7XW9uJrdGkK/5obYc=">AAACSXicbVBNSwMxFMzW7/pV9eglWJR6sOwWQS+CKIJHBVsXumvNpq8ams0uyVuxLP17Xrx58z948aCIJ9Pag1YHApOZeeRlolQKg6777BQmJqemZ2bnivMLi0vLpZXVhkkyzaHOE5loP2IGpFBQR4ES/FQDiyMJl1H3eOBf3oE2IlEX2EshjNmNEh3BGVqpVboOEO4xbzAtmOLQr/jbdOuABjHD2yjKT/rNik93flz9cPuqFgYBpcXfOd/KY8mrGqWtUtmtukPQv8QbkTIZ4axVegraCc9iUMglM6bpuSmGOdMouIR+McgMpIx32Q00LVUsBhPmwyb6dNMqbdpJtD0K6VD9OZGz2JheHNnkYFEz7g3E/7xmhp39MBcqzRAU/36ok0mKCR3USttCA0fZs4RxLeyulN8yzTja8ou2BG/8y39Jo1b13Kp3vls+PBrVMUvWyQapEI/skUNySs5InXDyQF7IG3l3Hp1X58P5/I4WnNHMGvmFwsQXn3qvtg==</latexit><latexit sha1_base64="ZsVW4Q/gat7XW9uJrdGkK/5obYc=">AAACSXicbVBNSwMxFMzW7/pV9eglWJR6sOwWQS+CKIJHBVsXumvNpq8ams0uyVuxLP17Xrx58z948aCIJ9Pag1YHApOZeeRlolQKg6777BQmJqemZ2bnivMLi0vLpZXVhkkyzaHOE5loP2IGpFBQR4ES/FQDiyMJl1H3eOBf3oE2IlEX2EshjNmNEh3BGVqpVboOEO4xbzAtmOLQr/jbdOuABjHD2yjKT/rNik93flz9cPuqFgYBpcXfOd/KY8mrGqWtUtmtukPQv8QbkTIZ4axVegraCc9iUMglM6bpuSmGOdMouIR+McgMpIx32Q00LVUsBhPmwyb6dNMqbdpJtD0K6VD9OZGz2JheHNnkYFEz7g3E/7xmhp39MBcqzRAU/36ok0mKCR3USttCA0fZs4RxLeyulN8yzTja8ou2BG/8y39Jo1b13Kp3vls+PBrVMUvWyQapEI/skUNySs5InXDyQF7IG3l3Hp1X58P5/I4WnNHMGvmFwsQXn3qvtg==</latexit>

Why? See if you can get this formula

COVARIANCEX

Y

where f : Y ! C is some function. Again, using f(y) = y results in E [Y |x] =P

yp(y|x) orE [Y |x] =

ýp(y|x)dy. We shall see later that under some conditions E [Y |x] is referred to

as the regression function. These types of integrals are often seen and evaluated in Bayesianstatistics.

For two random variables X and Y we also define

E [f(X,Y )] =

8><

>:

Px2X

Py2Y

f(x, y)p(x, y) X,Y : discrete

´X

Ýf(x, y)p(x, y)dxdy X, Y : continuous

Expectations can also be defined over a single variable

E [f(X, y)] =

8><

>:

Px2X

f(x, y)p(x) X : discrete

´Xf(x, y)p(x)dx X : continuous

where E [f(X, y)] is now a function of y.We define the covariance function as

Cov[X,Y ] = E [(X � E [X]) (Y � E [Y ])]

= E [XY ]� E [X]E [Y ] ,

with Cov[X,X] = V [X] being the variance of the random variable X. Similarly, we definea correlation function as

Corr[X,Y ] =Cov[X,Y ]pV [X] · V [Y ]

,

which is simply a covariance function normalized by the product of standard deviations.Both covariance and correlation functions have wide applicability in statistics, machinelearning, signal processing and many other disciplines. Several important expectations fortwo random variables are listed in Table 1.2.

Example 6: Three tosses of a fair coin (yet again). Consider two random variables fromExamples 3 and 5, and calculate the expectation and variance for both X and Y . Thencalculate E [Y |X = 0].

We start by calculating E [X] = 0 · pX(0) + 1 · pX(1) = 12 . Similarly,

E [Y ] =3X

y=0

y · pY (y)

= pY (1) + 2pY (2) + 3pY (3)

=3

2

The conditional expectation can be found as

E [Y |X = 0] =3X

y=0

y · pY |X(y|0)

= pY |X(1|0) + 2pY |X(2|0) + 3pY |X(3|0)

= 1

27

where f : Y ! C is some function. Again, using f(y) = y results in E [Y |x] =P

yp(y|x) orE [Y |x] =

ýp(y|x)dy. We shall see later that under some conditions E [Y |x] is referred to

as the regression function. These types of integrals are often seen and evaluated in Bayesianstatistics.

For two random variables X and Y we also define

E [f(X,Y )] =

8><

>:

Px2X

Py2Y

f(x, y)p(x, y) X,Y : discrete

´X

Ýf(x, y)p(x, y)dxdy X, Y : continuous

Expectations can also be defined over a single variable

E [f(X, y)] =

8><

>:

Px2X

f(x, y)p(x) X : discrete

´Xf(x, y)p(x)dx X : continuous

where E [f(X, y)] is now a function of y.We define the covariance function as

Cov[X,Y ] = E [(X � E [X]) (Y � E [Y ])]

= E [XY ]� E [X]E [Y ] ,

with Cov[X,X] = V [X] being the variance of the random variable X. Similarly, we definea correlation function as

Corr[X,Y ] =Cov[X,Y ]pV [X] · V [Y ]

,

which is simply a covariance function normalized by the product of standard deviations.Both covariance and correlation functions have wide applicability in statistics, machinelearning, signal processing and many other disciplines. Several important expectations fortwo random variables are listed in Table 1.2.

Example 6: Three tosses of a fair coin (yet again). Consider two random variables fromExamples 3 and 5, and calculate the expectation and variance for both X and Y . Thencalculate E [Y |X = 0].

We start by calculating E [X] = 0 · pX(0) + 1 · pX(1) = 12 . Similarly,

E [Y ] =3X

y=0

y · pY (y)

= pY (1) + 2pY (2) + 3pY (3)

=3

2

The conditional expectation can be found as

E [Y |X = 0] =3X

y=0

y · pY |X(y|0)

= pY |X(1|0) + 2pY |X(2|0) + 3pY |X(3|0)

= 1

27

PROPERTIES OF VARIANCES

• V[c] = 0 for a constant c • V[c X] = c^2 V[X] • V[X + Y] = V[X] + V[Y] + 2 Cov[X,Y] • If X and Y are independent, V[X + Y] = V[X] + V[Y]

• i.e., Cov[X,Y] = 0

INDEPENDENCE AND DECORRELATION

• Independent RVs have zero correlation • How can we tell? • Hint: use Cov[X,Y] = E[XY] - E[X]E[Y]

• Uncorrelated RVs (zero correlation) might be dependent

• Correlation (Pearson’s correlation) shows linear relationships; can miss nonlinear ones

• Example: X normal RV, Y = X^2 (whiteboard)

ALTERNATIVES: MUTUAL INFORMATION (USING KL)• KL-divergence measures how different two distributions

are

KL(p||q) =X

x2Xp(x) log

p(x)

q(x)

=X

x2Xp(x) [log p(x)� log q(x)]

=X

x2Xp(x) log p(x)�

X

x2Xp(x) log q(x)

<latexit sha1_base64="9AKpmBK2cWkXEOnZmrdAMoTKgD4=">AAAC0HicjVJNbxMxEPUuXyV8BTj2YhGB0gPRplTQHpAqekECpIJIGyleRV5nNrHq9W7sWZTIWSGu/Dxu/AL+Bt5NVBWKUEey9ebNvPHY46RQ0mIU/QzCa9dv3Ly1dbt15+69+w/aDx+d2Lw0AgYiV7kZJtyCkhoGKFHBsDDAs0TBaXJ2VMdPv4CxMtefcVlAnPGplqkUHD01bv9iCAt0795X3YKuVnS+Q1vPXlNmy2zsFpRJTVnGcSa4csOqokV3sUOZyqeUpYYLV/uVm9c7Y1dRQoqjRt/4z9e15k3MyOkM46uVuVDgf5nn1cftTtSLGqOXQX8DOmRjx+P2DzbJRZmBRqG4taN+VGDsuEEpFFQtVloouDjjUxh5qHkGNnbNQCr61DMTmubGL420YS8qHM+sXWaJz6w7tn/HavJfsVGJ6X7spC5KBC3WB6WlopjTerp0Ig0IVEsPuDDS90rFjPtBof8DreYRDmp7eX7ly+Bkt9d/0dv7uNc5fLN5ji2yTZ6QLumTV+SQvCXHZEBE8CGwwSqowk/hIvwaflunhsFG85j8YeH33ykj3BU=</latexit>

or

KL(p||q) =Z

Xp(x) log

p(x)

q(x)dx

<latexit sha1_base64="KG0zdAqSBfbo9nwjY1FE/ceTcQw=">AAACP3icbVBNTxsxFPRCW8L2g0CPvVhEReESbQDxcUBC9FKpHEBqQqQ4iryON1h47cV+ixI5+8+49C/0xpULh1YVV254k6hqS0eyNZ73np5n4kwKC1F0Gywsvnj5aqmyHL5+8/bdSnV1rW11bhhvMS216cTUcikUb4EAyTuZ4TSNJT+PLz+V9fNrbqzQ6iuMM95L6VCJRDAKXupX2xsE+AicNgUh4Yx/OSnqGZ5M8NUmDjcOMREK+o6kFC4Yla5TFDirjzYxkXqISWIoc+W7cFfljQcjQvrVWtSIpsDPSXNOamiO0371OxlolqdcAZPU2m4zyqDnqAHBJC9CklueUXZJh7zrqaIptz039V/gj14Z4EQbfxTgqfrnhKOpteM09p2lCftvrRT/V+vmkOz3nFBZDlyx2aIklxg0LsPEA2E4Azn2hDIj/F8xu6A+EPCRh9MQDkrs/rb8nLS3Gs3txs7ZTu3oeB5HBX1A66iOmmgPHaHP6BS1EEM36A79QD+Db8F98Ct4mLUuBPOZ9+gvBI9PY6Kusg==</latexit>

ALTERNATIVES: MUTUAL INFORMATION• KL-divergence measures how different two distributions

are

• Mutual Information between X and Y: • I(X; Y) = KL(p_{xy} || p_x p_y) • only zero when X and Y are independent • measure of price for encoding (X,Y) as independent

RVs, even when they are not

EXERCISE: MODELLING COMMUTE TIMES

• Let’s imagine we have 365 samples of commute times • Say you wanted to model commute time C as a

continuous RV (takes 7.6 minutes to get to work) • This means we have to specify (or learn) the pdf; how? • One option: pick distribution type (e.g., Gaussian), and

find the “best” parameters that match the data • What are the parameters to learn? • Is a Gaussian a good choice?

EXERCISE: MODELLING COMMUTE TIMES (CONT.)

• Note a better choice is actually a Gamma dist. • Gaussian distribution (or gamma) for commute times

extrapolates between recorded time in minutes

EXERCISE: CONDITIONAL PROBABILITIES

• Using conditional probabilities, we can incorporate other external information (features)

• Let y be the commute time, x the day of the year • Array of conditional probability values —> p(y | x)

• y = 1, 2, … and x = 1, 2, …, 365 • What other x could we use?

EXERCISE: ADDING IN AUXILIARY INFORMATION (1)

• Mean, variance for p(y | x) could depend on value of x • Example: X = 1 if slippery out, and X = 0 else

Gaussian denoted by N

p(y|X = 0) = N�µ0,�

20

�

p(y|X = 1) = N�µ1,�

21

�<latexit sha1_base64="urhoOP+QZFfszXmFHWgTPXYAT+k=">AAACX3icdVFbS8MwGE3rbc5b1SfxJTiUCTJaFS8PguiLT6LgdLDMkmbpFkzaknwdjLo/6Zvgi//EdhvzfiBwOOc7JN9JkEhhwHVfLXtqemZ2rjRfXlhcWl5xVtfuTZxqxusslrFuBNRwKSJeBwGSNxLNqQokfwieLgv/oce1EXF0B/2EtxTtRCIUjEIu+U4vqfbxM27gM+zu4p0zTBSFLqMyux4QyUOoEpX67h4mRnQU9d3HfaJFpwu7hJQ/s97/WW+S9SbZsu9U3Jo7BP5NvDGpoDFufOeFtGOWKh4Bk9SYpucm0MqoBsEkH5RJanhC2RPt8GZOI6q4aWXDfgZ4O1faOIx1fiLAQ/VrIqPKmL4K8sliA/PTK8S/vGYK4UkrE1GSAo/Y6KIwlRhiXJSN20JzBrKfE8q0yN+KWZdqyiD/klEJpwWOJiv/Jvf7Ne+gdnh7WDm/GNdRQptoC1WRh47RObpCN6iOGHqzbGvBWrTe7Tl72XZGo7Y1zqyjb7A3PgD6arB3</latexit>

EXERCISE: ADDING IN AUXILIARY INFORMATION (2)

• Can incorporate external information (features) by modeling parameter = function(features)

µ =dX

j=1

wixi

<latexit sha1_base64="C+FHd4OxPMaQJmhWk3lXIXY53Rg=">AAACA3icbVDLSsNAFJ3UV62vqDvdDBbBVUm0+FgUim5cVrAPaGKYTKbt2JkkzEzUEgpu/BU3LhRx60+4829M0iBqPXDhcM693HuPGzIqlWF8aoWZ2bn5heJiaWl5ZXVNX99oySASmDRxwALRcZEkjPqkqahipBMKgrjLSNsdnqV++4YISQP/Uo1CYnPU92mPYqQSydG3LB7BGrRkxJ34umaOrzx461B451BHLxsVIwOcJmZOyiBHw9E/LC/AESe+wgxJ2TWNUNkxEopiRsYlK5IkRHiI+qSbUB9xIu04+2EMdxPFg71AJOUrmKk/J2LEpRxxN+nkSA3kXy8V//O6keod2zH1w0gRH08W9SIGVQDTQKBHBcGKjRKCsKDJrRAPkEBYJbGVshBOUhx+vzxNWvsV86BSvaiW66d5HEWwDXbAHjDBEaiDc9AATYDBPXgEz+BFe9CetFftbdJa0PKZTfAL2vsXntqXAw==</latexit>

p(y|x) = N

0

@µ =dX

j=1

wixi,�2

1

A

<latexit sha1_base64="nKc3NOYTWMM+/I4lyv/P7mpWr+w=">AAACOnicbVDLSgMxFM34tr6qLt0Ei1BBykwVHwtBdONKFKwKnTpk0kwbTWaG5I5axvkuN36FOxduXCji1g8w0xbxdSBwcs69JOf4seAabPvRGhgcGh4ZHRsvTExOTc8UZ+dOdJQoymo0EpE684lmgoesBhwEO4sVI9IX7NS/3Mv90yumNI/CY+jErCFJK+QBpwSM5BWP4nIH32JXEmj7QXqTLePt3o0SkR5krmABlF2Z5LJOpJdebDvZeRNfexzfeHzFqLwlyXnVVbzVhmXsFUt2xe4C/yVOn5RQH4de8cFtRjSRLAQqiNZ1x46hkRIFnAqWFdxEs5jQS9JidUNDIplupN3oGV4yShMHkTInBNxVv2+kRGrdkb6ZzEPp314u/ufVEwg2GykP4wRYSHsPBYnAEOG8R9zkilEQHUMIVdz8FdM2UYSCabvQLWErx/pX5L/kpFpxVitrR2ulnd1+HWNoAS2iMnLQBtpB++gQ1RBFd+gJvaBX6956tt6s997ogNXfmUc/YH18AiLhrJE=</latexit>

Gaussian denoted by N

MIXTURES OF DISTRIBUTIONS

MIXTURES OF GAUSSIANS

* Image from https://people.ucsc.edu/~ealdrich/Teaching/Econ114/LectureNotes/kde.html

MIXTURES CAN PRODUCE COMPLEX DISTRIBUTIONS

THINK-PAIR-SHARE (5 MINUTES)• Notice that moving to continuous RVs puts more restrictions

on the distributions we can define • For discrete RVs, distributions are tables of probabilities and

so are highly flexible • we can define any possible distribution

• So, why not just discretize our variables? • Example: imagine you have an RV in range [-10,10] • You decide to discretize into chunks of size 0.01 • How many variables do you need to define the PMF? • What if had instead modelled it as a Gaussian? Or a mixture

of two Gaussians?

12

34

5

45

67

89

1011

1213

14

0

1

2

3

4

5

6

7

8

Red lights

Commute time

Occ

uren

ces

THINK-PAIR-SHARE

• Example: imagine you have an RV in range [-10,10] • You decide to discretize into chunks of size 0.01 • How many variables do you need to define the PMF? • What if had instead modelled it as a Gaussian? Or mixture

of two Gaussians? • Additional question if you have time: imagine you have a

2-dim. RV (in [-10, 10] x [-10,10]). Now imagine you discretize to the same level. How many variables in this PMF? (i.e., this multidimensional array?)

EXERCISE: SAMPLE AVERAGE IS AN UNBIASED ESTIMATOR

Obtain instances x1, . . . , xn

What can we say about the sample average?

This sample is random, so we consider i.i.d. random variables

X1, . . . , Xn

Reflects that we could have seen a di↵erent set of instances xi

E"1

n

nX

i=1

Xi

#=

1

n

nX

i=1

E[Xi]

=1

n

nX

i=1

µ

= µ

For any one sample x1, . . . , xn, unlikely that1

n

nX

i=1

xi = µ

probability theory - github pages · probability of seeing a stopping time of exactly 3.141596?...

Documents