discrete random variablesdiscrete random variables chapter 3 { lecture 8 yiren ding shanghai qibao...

Discrete Random VariablesChapter 3 – Lecture 8

Yiren Ding

Shanghai Qibao Dwight High School

March 22, 2016

Yiren Ding Discrete Random Variables 1 / 25

Outline

1 Introduction to Random VariablesIntroductionRandom Variables

2 Probability Mass FunctionExample

3 Introduction to StatisticsMathematics vs StatisticsClassical vs BayesianExample


Introduction to Random Variables Introduction

Introduction

Consider a chance experiment whose outcomes are qualitative. Forexample, rolling a special die whose faces show not 1 to 6, but “KingBull,” “Red Baby,” “Closed Feather,” “Elizabeth Charge,” “FlyingSunmous,” and “Happy Sally.”

Suppose we want to consider the event G as obtaining a strangemonster in a single roll.

So we define a random variable X as the function that shows howmany seconds each person can last against God Erlang.

It turns out that King Bull can last 86400 seconds, Red Baby can last35690 seconds, while everyone else stand no chance against GodErlang, so they last 0 seconds.

The event G is thus equivalent to X 6= 0, and

P(G ) = P(X = 86400) + P(X = 35690).


Introduction to Random Variables Random Variables

Random Variables

A random variable is formally defined as a real-valued function onthe sample space of a chance experiment.

Conventionally, we use the capital letter X ,Y and Z .

A random variable X assigns a numerical value X (ω) to each ω ∈ Ω.For example, a chance experiment is to roll a fair die twice. We definethe random variable X as the function that assigns each outcome(i , j) a numerical value i + j , the sum of the two rolls.

So X can take on any of the values 2, 3, ..., 12.

So a random variable is a function that takes on its values by chance.

A random variable X only gets its value after the experiment has beenperformed. Before the experiment, we can only describe the set of allpossible values of X , called the range of X , usually denoted by I .


Introduction to Random Variables Random Variables

Examples of Random Variables

1 your highest ACT score

2 the number of aviation disasters in the world next year

3 the total number of times you will fail your quizzes next semester

4 the amount of rainfall that Shanghai will receive next year

5 the time it takes until one of your pokemon eggs hatch

6 the duration of your next phone call

7 the remaining lifetime of your laptop’s battery

The first three examples are discrete random variables, since theytake on a discrete (finite or countably infinite) number of values.

Examples 4–7 are continuous random variables because they takeon continuum (uncountably infinite number) of values.


Probability Mass Function

Probability Mass Function

Suppose we associate a discrete random variable X with a chanceexperiment ε having a well-defined probability space (Ω,P).

For x ∈ I , the probability mass function of X is defined by

P(X = x) = P({ω : X (ω) = x}).

In words, P(X = x) is called the probability mass since it adds all individualprobabilities p(ω) such that X (ω) = x .

For example, a chance experiment is to roll a fair die twice. We define therandom variable X as the function that assigns each outcome (i , j) anumerical value min(i , j), the smaller of the two rolls.

The range of X is I = {1, 2, ..., 6}, and X = 1 refers to the outcomes{(1, 1), ..., (1, 6), (2, 1), ..., (6, 1)}, so P(X = 1) = 1136 , P(X = 2) =

936 ,

P(X = 3) = 736 , P(X = 4) =536 , P(X = 5) =

336 , and P(X = 6) =

136 .


Probability Mass Function Example

Probability Distribution



Example 1.

In your pocket you have three dimes (10 cents) and two quarters (25cents). You grab at random two coins from your pocket. What is theprobability mass function of the amount you grabbed?

The sample space of this chance experiment is

Ω = {(D,D), (D,Q), (Q,D), (Q,Q)}.

The probabilities are p(D,D) = 35 ×24 =

310 , p(Q,Q) =

25 ×

14 =

110 ,

p(Q,D) = 25 ×34 =

310 , and p(D,Q) =

35 ×

24 =

310 .

Let the random variable X denote the total number of cents yougrabbed. The range of X is I = {20, 35, 50}.The probability mass function of X is given ty P(X = 20) = 310 ,P(X = 35) = 310 +

310 =

35 , and P(X = 50) =

110 .



Example 2.

You have a well-shuffled ordinary deck of 52 cards. You remove the cardsone at a time until you get an ace. Let the random variable X be thenumber of cards removed. What is the probability mass function of X?

The range of the random variable X in this case is

I = {1, 2, ..., 49}.

Obviously P(X = 1) = 452 , and using the Kindergarten Rule,

P(X = i) =48

52× · · · × 48− (i − 2)

52− (i − 2)× 4

52− (i − 1).

Notice here it is not necessary to explicitly list the outcomes in thesample space. All we need is a well-defined probability measure thatassigns probabilities to properly chosen events.



Probability Distribution


Introduction to Statistics Mathematics vs Statistics

Mathematics vs Statistics

Statistics and probability theory are distinct disciplines.

Probability theory is a branch of mathematics. In mathematics, wereason from general to specific.

Given a number of axioms and definitions, we can make manyconclusions in the forms of theorems and propositions.

This is called deductive reasoning.

Statistics, on the other hand, uses inductive reasoning.

Statisticians start with the specific, in the form of data, and worktowards the general conclusions.

To do so, statisticians must select a method based on one of twoschools of thought: classical and Bayesian.


Introduction to Statistics Classical vs Bayesian

Classical vs Bayesian Statistics

Classical statistics (frequentist approach) treats population parameters asfixed, unknown constants, and we can make initial claims about thisparameter. We call this the null hypothesis H0, which is to be tested.

The complement of H0 is called the alternative hypothesis H1.

As a new piece of evidence is observed, given that H0 is true, the probabilityof obtaining a result at least as extreme as the observed evidence, known asthe p-value, is found.

Incidentally, if we let Ê be the event of obtaining a result at least as extremeas the evidence E , then the p-value is the excess probability P(Ê |H0).

A small p-value (typically ≤ 0.05) indicates strong evidence against the nullhypothesis, so we reject H0. (NOT necessarily in favor of H1. )

A large p-value (typically > 0.05) indicates weak evidence against the nullhypothesis, so we fail to reject H0. (NOT necessarily against H1 either.)


Introduction to Statistics Classical vs Bayesian

Classical vs Bayesian Statistics

Bayesian statistics takes a fundamentally different viewpoint.

In the Bayesian approach, population parameters are not fixedconstants, but treated as random variables with an prior probabilitydistribution. The distribution is constantly updated in light of newinformation.

Simply put, the classical statistics focuses on P(Ê |H0), whereas theBayesian approach makes use of P(H0 |E ), which is often times whatpeople truly wish to know.

The Bayesian approach, however less likely to be misunderstood, relieson a proper prior distribution, which can make it somewhat subjective.

The classical approach is not entirely objective either since the choiceof whether to reject the null hypothesis at 5% or 1% significance levelis itself subjective.


Introduction to Statistics Example

Example 3 (Classical vs Bayesian).

Imagine a multiple choice exam consisting of 50 questions, each of which hasthree possible answers. A student receives a passing grade if he/she correctlyanswers more than half of the questions. Take the case of a student who managesto answer 26 of the 50 questions correctly and claims not to have studied, butrather to have obtained 26 correct answers merely by guessing. How can you testthis claim in both classical and Bayesian statistical approaches?

Let H0 be the hypothesis that the student answered all questions by luckalone, and let Ê denote the event of correctly answering 26 or more (at leastas extreme as 26) of the 50 questions by luck alone.

Then this excess probability P(Ê |H0) is equal to 0.0049 (how did I findthis?), which equals the p-value in classical statistics.

Since the p-value is so small, we reject the null hypothesis H0, and concludethat the student is bluffing and in fact did prepare for the exam.



Example 3 – Bayesian approach

However, one might have information about earlier achievements ofthe student, which is overlooked by the classical approach.

The Bayesian approach specifies the prior probability distribution bylooking at the student’s performance in earlier exams and homework.

We proceed to calculate the probability the student did not prepare forthe exam given that he/she answered 26 of the 50 questions correctly.

For simplicity’s sake, let us assume that there are two possibilities:either the student was totally unprepared (H0) or well prepared (H1).

Assume based on earlier data that P(H0) = 0.2 and P(H1) = 0.8.

Also assume that each question has a 1/3 chance of being answeredcorrectly by guessing and 0.7 chance if the student is well-prepared.



Example 3 – Bayesian approach

Using Bayes Theorem in Odds Form, we find the posterior odds to be

P(H0 |E )P(H1 |E )

=0.2

0.8×(5026

) (13

)26 (23

)24(5026

) (710

)26 ( 310

)24 ≈ 0.2204,which translates to a probability of 0.1806 that the student did notstudy given correctly answering 26 of the 50 questions.

The Bayesian evidence against the hypothesis is not very strong! Ingeneral, the Bayesian approach is more careful than the classicalapproach as the classical approach often exaggerates the strength ofevidence against the hypothesis.

This is because a low p-value P(Ê |H0) does not necessarily imply alow Bayes factor P(E |H0) : P(E |H1)!



Example 4.

A new treatment is tried out for a disease for which the historical success rate is35%. The discrete uniform distribution on 0, 0.01, ..., 0.99, and 1 is taken as priorfor the success probability of the new treatment. The experimental design is tomake exactly ten observations by treating ten patients. The experimental studyyields seven successes and three failures. What is the posterior probability thatthe new treatment is more effective than the standard treatment?a

aD.A. Berry, “Bayesian clinical trials,” Nature Reviews Drug Discovery 5 (2006): 27-36.

We represent the unknown success probability of the new treatment by therandom variable Θ, with the uniform probability distribution,

P0(Θ = θ) = p0(θ) =1

101for θ = 0, 0.01, ..., 0.99, 1.

In order to update this prior distribution given the observed data E , we needto define the so-called likelihood function L(E | θ).



Example 4 solution

The evidence we are given is the situation of seven successes in thetreatment of ten patients, so we have

L(E | θ) =(

10

7

)θ7(1− θ)3 for θ = 0, 0.01, ..., 0.99, 1.

To find the posterior probability p(θ) = P(Θ = θ |E ), we use the standardform of Bayes’ rule:

P(Θ = θ |E ) = L(E | θ)p0(θ)∑θ L(E | θ)p0(θ)

for θ = 0, 0.01, ..., 0.99, 1.

Letting θi =i

100 for 0 ≤ i ≤ 100 and inserting L(E | θ) and p0(θ), we obtain

p(θi ) =θ7i (1− θi )3∑100

k=0 θ7k(1− θk)3

for i = 0, 1, ..., 100.



Example 4 solution

In particular, the posterior probability of the new treatment being moreeffective than the standard treatment is given by,

100∑i=36

p(θi ) = 0.9866.

Incidentally, the posterior probability that the new treatment is not moreeffective than the standard treatment is

∑35i=0 p(θi ) = 0.0134.

This value is not very different from the value of 0.0260 of the excessprobability (the p-value) of obtaining seven or more successes in ten trialsunder the hypothesis that the new treatment causes no difference.

The p-value is often mis-interpreted; however, the Bayesian posteriorprobability is directly interpretable.

By assuming a “non-informative” prior distribution such as the uniformdistribution, the results of the trials carry essentially all the influence in theposterior distribution.



Example 4 prior distribution



Example 4 posterior distribution



Example 5.

Two candidates A and B are contesting the election for governor in agiven state. The candidate who wins the popular vote becomes governor.A random sample of the voting population is undertaken to find out thepreference of the voters. The sample size of the poll is 1,000 and 517 ofthe polled voters favor candidate A. What can be said about theprobability of candidate A winning the election?

First let’s assume a prior distribution:

p0(θ) =

{θ−0.294.41 for θ = 0.30, ..., 0.50,

0.71−θ4.41 for θ = 0.51, ..., 0.70.

Why in the world did I define it this way? What does it look likewhen you plot this probability distribution?



Example 5 solution – Triangular distribution (prior)



Example 5 solution

Hence the prior probability of candidate A getting the majority of the votesis p0(0.51) + · · ·+ p0(0.70) = 0.476.The evidence E is having 517 of the 1, 000 favor candidate A.

In light of this information, we wish to find the posterior probabilityp(0.51) + · · ·+ p(0.70), where p(θ) is the posterior probability that thefraction of the voting population in favor of candidate A equals θ.

Using L(E | θ) =(1000517

)θ517(1− θ)483, this posterior probability is,

p(θ) =θ517(1− θ)483p0(θ)∑70

a=30

(a

100

)517 (1− a100

)483p0(

a100

) .Using CAS, we find the posterior probability of gaining majority as

p(0.51) + · · ·+ p(0.70) = 0.7632.

The posterior probability of a tie at the election is p(0.50) = 0.1558.



Example 5 – Posterior distribution


Introduction to Random VariablesIntroductionRandom Variables

Probability Mass FunctionExample

Introduction to StatisticsMathematics vs StatisticsClassical vs BayesianExample

discrete random variablesdiscrete random variables chapter 3 { lecture 8 yiren ding shanghai qibao...

Documents