probability theory (undergraduate) - wordpress.com · 2018-12-11 · 2018 fall probability theory...

Probability Theory (Undergraduate)

Yiqiao YINColumbia University

December 11, 2018

Abstract

This document is prepared for students in Probability Theory (Undergraduate)offered at Columbia University in 2018 Fall semester. The course instructor is Pro-fessor Shaw-Hwa Lo. The grader for this course is Xiaotong. The document servesstudents by providing lecture notes as well as homework and exam guidance. I amgrateful for Professor Shaw-Hwa Lo for providing materials and I thank Xiaotong forproviding comments of this document. I am the TA for this class. Please email meyy2502columbia.edu if you have any questions.

1

2018 Fall Probability Theory [Yiqiao Yin] §0

Contents1 Counting Method 4

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Permutation and Combination . . . . . . . . . . . . . . . . . . . . . . . 4

2 Axioms of Probability 72.1 Sample Space and Events . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Axioms of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Conditional Probability and Independence 113.1 Bayes’ Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Independent Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Random Variables 154.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 164.3 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.4 Expectation of a Function of a Random Variable . . . . . . . . . . . . . 174.5 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.6 The Bernoulli and Binomial Random Variables . . . . . . . . . . . . . . 194.7 Poisson random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Continuous Random Variables 235.1 Expectation and Variance of Continuous Random Variables . . . . . . . 245.2 Uniform Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . 265.3 Normal Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 275.4 Exponential Random Variable . . . . . . . . . . . . . . . . . . . . . . . . 31

6 Jointly Distributed Random Variables 336.1 Joint Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . 336.2 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . 356.3 Sums of Independent Random Variables . . . . . . . . . . . . . . . . . . 36

7 Properties of Expectation 397.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397.2 Expectation of Sums of Random Variables . . . . . . . . . . . . . . . . . 397.3 Moments of the Number of Events that Occur . . . . . . . . . . . . . . . 417.4 Covariance, Variance of Sums, and Correlations . . . . . . . . . . . . . . 427.5 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . 437.6 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . 43

8 Limit Theorems 488.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488.2 Chebyshev’s Inequality and the Weak Law of Large Numbers . . . . . . 488.3 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 528.4 The Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . 548.5 Other Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

9 Homework 55


10 Exam Review 8810.1 1st Midterm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8810.2 2nd Midterm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9210.3 Final Exam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96


1 Counting MethodGo back to Table of Contents. Please click TOC

1.1 IntroductionLet us start with an example about an event with multiple possible outcomes. Anexperiment, for example, can lead to n possible outcomes:

a1, a2, ..., an

For each ai there are m possible outcomes from another experiment, then togetherthere are nm possible outcomes. For example, how many outcomes one possible bytossing a coin twice? We can approach to answer this question from frequentist pointof view, which is objective. One can start with a fair coin. Tossing the coin for the firsttime, one will observe either head or tail. By tossing the coin for the second time, onewill observe, assuming using a fair coin, head or tail. One can continue this experimentand will observe, assuming tossing coin twice, the following HH, HT, T H, T T , using“H” for heads and “T” for tails.

It is not always the case that an experiment can be repeated. We collect previousdata, called prior. We can observe new data as time moves on and we can use it asnew information. We update our prior data with new information and we will arrive amore sophisticated analysis, which is called posterior. This school of thoughts, which iscalled Bayesian, are usually more subjective due to the fact that there is a prior beforeanalysis starts.

Frequentist and Bayesian are two schools of thoughts in the field of statistics. Fre-quentist dominated the field in the 60’s and 70’s. Starting since the past 20 to 30 years,there are a good amount of Bayesian approach emerged.

For this course, we will mostly be dealing with repeatable experiments. We mayhave different outcomes, but the experiments we will be discussing can be replicatedunder the same condition.

1.2 Permutation and CombinationLet us look at the following example. Consider the word “statistics”. How manydifferent letter arrangements would you have? Let us separate these letters and countthat there are three s’s, three t’s, two i’s, one a, and one c. In this case we havepermutation

10! = 10 × 9 × · · · × 2 × 1but we need to consider the cases that if you switch two’s, example, the results will bethe same in the word “statistics”. Hence, we need the following

3!3!2!1!

and together we haven!

n1! . . . nk!different ways to arrange n objects, of which n1, n2, ..., nk are alike.Definition 1.2.1. Suppose now that we have n objects. Reasoning similar to that wehave just used for the letters example in lecture then shows that there are

n(n − 1)(n − 2) . . . (3)(2)(1) = n!

different permutations of the n objects.


Definition 1.2.2. In general, the same reasoning used in lecture shows that there are

n!n1!n2! . . . nr!

different permutations of n objects, of which n1 are alike, n2 are alike, ..., etc.Example 1.2.3. Let us look at another example. How many different groups of 3 canbe related from 5 items, A, B, C, D, and E? In this case, There are

(53

)= 10 number

of different groups of 3.Definition 1.2.4. We define

(nr

), for r ≤ n, by(

n

r

)= n!

(n − r)!r!

and say that(

nr

)represents the number of possible combinations of n objects taken r

at a time.Another interesting example is the following.

Example 1.2.5. For a class of 20, 12 boys and 8 girls. How many different groupsconsisting of 3 boys and 2 girls can be formed? What if 2 of the boys refuse to be inthe same group together?

A useful formula is in the following(n

k

)=

(n − 1k − 1

)+

(n − 1

k

)and the famous binomial formula is

(x + y)n =n∑

k=0

(n

k

)xkyn−k

Proof. Consider the following

(x + y)n = (x + y) . . . (x + y)︸︷︷︸n times

if n = 2 :(x + y)2 = x2 + xy + yx + y2, (each term contributes once)

while each x and y contributes xkyn−k. There are n choose k, i.e.(

nk

), number of these

contributions and that’s why it leads to(

nk

)xkyn−k.

Example 1.2.6. Ten balls marked 1 to 10 are put into 3 bags A, B, and C with 3 inA, 3 in B, and 4 in bag C. How many ways? Bags are all assumed to be distinct. Wehave 3! in the first bag, 3! in the second bag, and 4! in the third bag. In total, thereare 10! possible outcome for all 10 balls. Thus, the answer is

10!3!3!4!

Let us formally introduces binomial theorem.


Theorem 1.2.7. IMPORTANT The binomial theorem states the following

(x + y)n =n∑

k=0

(n

k

)xkyn−k

Proof. When n = 1, we have

x + y =(

10

)x0y1 +

(11

)x1y0 = y + x

Assume the above results hold for n − 1. Now, consider

(x + y)n = (x + y)(x + y)n−1

= (x + y)n−1∑k=0

(n − 1

k

)xkyn−1−k

=n−1∑k=0

(n − 1

k

)xk+1yn−1−k +

n−1∑k=0

(n − 1

k

)xkyn−k

Letting i = k + 1 for the first term and i = k for the second term, we have

(x + y)n =n∑

i=1

(n − 1i − 1

)xiyn−i +

n∑i=1

(n − 1

i

)xiyn−i

= xn +n−1∑i=1

[(n − 1i − 1

)+

(n − 1

i

)]xiyn−i + yn

=n∑

i=0

(n

i

)xiyn−i

and we are done.

Let us introduce the following propositions that may be helpful in this topic.Proposition 1.2.8. There are

(n−1r−1

)distinct positive integer-valued vectors (x1, ..., xr)

satisfying the equation

x1 + x2 + · · · + xr = n, xi > 0, i = 1, ..., r

Proposition 1.2.9. There are(

n+r−1r−1

)distinct nonnegative integer-valued vectors

(x1, ..., xr) satisfyingx1 + x2 + · · · + xr = n

Example 1.2.10. How many distinct nonnegative integer-valued solutions of x1+x2 =3 are possible?

Answer. There are(3+2−1

2−1

)= 4 such solutions: (0,3), (1,2), (2,1), (3,0).


2 Axioms of ProbabilityGo back to Table of Contents. Please click TOC

This section introduces the concepts of the probability of an event and then showhow probabilities can be computed in certain situations.

2.1 Sample Space and EventsWe define event as a specific situation. It is a subset of an outcome defined in an exper-iment. For example, tossing a fair coin once can result in head or tail. The event canbe head or tail. Sample space is the superset of all the possible outcomes. For example,tossing two fair coins together, the sample space consists of HH, HT, T H, T T .Example 2.1.1. Two draws are made from the following box with 3 balls, call themG, Y, and B.

1. Consider all possible arrangements with replacement. We have 3 × 3 = 9. Thatis, we have

S = (G, G), (B, B), (Y, Y ), (G, B), (G, Y ), (B, G), (B, Y ), (Y, G), (Y, B)

It can be anywhere in the following matrix.

1/2 G Y B

G (G, G) () (G, B)Y () (Y, Y ) ()B () () (B, B)

Same question without replacement. Then we have 3 × 2 = 6. In matrix form, we donot count the diagonal because without replacement means once G is drawn it cannotappear for a second time.Definition 2.1.2. Event E and event F are said to be mutually exclusive if E ∩ F =,that is, there is no element that simultaneously exists in E and F .

The operations forming unions, intersections, and complements of events obey cer-tain rules similar to the rules of algebra.Proposition 2.1.3. We have the following rules

1. Commutative laws: E ∪ F = F ∪ E, or EF = F E

2. Associative laws: (E ∪ F ) ∪ G = E ∪ (F ∪ G), or (EF )G = E(F G)3. Distribution laws: (E ∪ F )G = EG ∪ F G or EF ∪ G = (E ∪ G)(F ∪ G)

Theorem 2.1.4. DeMorgan’s Law states that( n⋃i=1

Ei

)c

=n⋃

i=1

E2i

( n⋂i=1

Ei

)c

=n⋂

i=1

E2i

Example 2.1.5. A famous example is to consider event E and F, by DeMorgan’s law,we have

(E ∪ F )c = EcF c and (EF )c = Ec ∪ F c


Proposition 2.1.6. Let us introduce the following proposition, called inclusion-exclusionidentity.

P (E1 ∪ E2 ∪ · · · ∪ En) =n∑

i=1

P (Ei) −∑

i1<i2

P (Ei1 Ei2 ) + . . .

+ (−1)r+1∑

i1<i2<···<ir

P (Ei1 Ei2 . . . Eir )

+ · · · + (−1)n+1P (E1E2 . . . En)

Example 2.1.7. This problem is from text[1] page 34. An urn contains n balls,one of which is special. If k of these balls are withdrawn one at a time, with eachselection being equally likely to be any of the balls that remain at the time, what isthe probability that the special ball is chosen?

Solution. Since all of the balls are treated in an identical manner, it follows that theset of k balls selected is equally likely to be any of the

(nk

)sets of k balls. Therefore,

P (special ball is selected) =(1

1

)(n−1k−1

)(nk

) = k

n

We could also have obtained this result by letting Ai denote the event that thespecial ball is the ith ball to be chosen, i = 1, ..., k. Then, since each one of the n ballsis equally likely to be the ith ball chosen, it follows that P (Ai) = 1/n. Hence, becausethese events are clearly mutually exclusive, we have

P (special ball is selected) = P

( k⋃i=1

Ai

)=

k∑i=1

P (Ai) = k

n

We could also have argued that P (Ai) = 1/n, by noting that there are n(n−1) . . . (n−k+1) = n!/(n−k)! equally likely outcomes of the experiment, of which (n−1)!/(n−k)!result in the special ball being the ith one chosen. From this reasoning, it follows that

P (Ai) = (n − 1)!n! = 1

n

Let us introduce some simple properties (notes from class).1. If E ⊂ F , then P (E) ≤ P (F ) and F = E ∪ (Ec ∩ F ) ⇒ P (F ) ≥ P (E)2. P (E ∪ F ) = P (E) + P (F ) − P (EF ) and also P (E ∪ F ) = P (E) + P (EcF ) and

P (EcF ) = P (F ) − P (EF )3.

P (n⋃

i=1

Ei) =n∑

i=1

P (Ei) −∑i<j

P (EiEj) + . . .

+ (−1)k+1∑

i1<i2<···<ik

P (Ei1Ei2 . . . Eik)

+ (−1)n+1P (E1E2 . . . En)


Sample space with equal chances for all outcome. If S is such a sample space|S| = n, the size of S for w ∈ S. This gives us P (w) = 1

n. For all E ⊂ S, an event, we

have P (E) = |E||S| = m

nif |E| = m and for m ≤ n.

2.2 Axioms of ProbabilityConsider an experiment whose sample space is S. For each event E of the sample spaceS, we assume that a number P (E) is defined and satisfies the following three axiomsProposition 2.2.1. The three axioms of probability

1. Axiom 1: 0 ≤ P(E) ≤ 12. Axiom 2: P (S) = 13. Axiom 3: For any sequence of mutually exclusive events E1, E2, ... (that is, events

of for which EiEj = ∅ when i 6= j),

P

( ∞⋃i=1

)=

∞∑i=1

P (Ei)

We refer to P (E) as the probability of the event E.Example 2.2.2. If our experiment consists of tossing a coin and if we assume that ahead is as likely to appear as a tail, then we have

P (H) = 12 and P (T ) = 1

3However, if the coin were biased and we believed that a head were twice as likely toappear as a tail, we should have

P (H) = 23 and P (T ) = 1

3Let us elaborate the experiment a little in the following example.

Example 2.2.3. If a die is rolled and suppose that all six sides are equally likely toappear, then we have P (1) = P (2) = P (3) = P (5) = P (6) = 1

6 . FromAxiom 3, we can compute the probability of rolling an even number to be

P (2, 4, 6) = P (2) + P (4) + P (6) = 12

Let us introduce more properties.Proposition 2.2.4.

P (Ec) = 1 − P (E)Proposition 2.2.5. If E ⊂ F , then P (E) ≤ P (F ).Proposition 2.2.6.

P (E ∪ F ) = P (E) + P (F ) − P (EF )

Proposition 2.2.7.

P (E1 ∪ E2 ∪ . . . En) =n∑

i=1

P (Ei) −∑

i1<i2

P (Ei1 Ei2 )

+ · · · + (−1)r+1∑

i1<...ir

P (Ei1 Ei2 . . . Eir )

+ · · · + (−1)n+1P (E1E2 . . . En)


Example 2.2.8. A committee of 5 is to be selected from a group of 6 men and 9women. If the selection is made randomly, what is the probability that the committeeconsists of 3 men and 2 women?

Answer. Because each of the(15

5

)possible committees is equally likely, we know bottom

of the fraction is(15

5

). Then we only need to find top of the fraction, which is the

number of possible choices for men times the number of possible choices for women,e.g.

(63

)(92

). Hence, the final answer is(6

3

)(92

)(155

) = 2401001


3 Conditional Probability and IndependenceGo back to Table of Contents. Please click TOC

Let E and F denote, respectively, the event that the sum of the dice is 8 and theevent that the first die is a 3, then the probability just obtained is called conditionalprobability that E occurs given that F has occurred. This term is denoted by

P (E|F )

A general formula for P (E|F ) that is valid for all events E and F is derived in the samemanner: If the event F occurs, in order for E to occur, it is necessary that the actualoccurrence be a point in both E and F; that is, in EF.Definition 3.0.1. If P (F ) > 0, then

P (E|F ) = P (EF )P (F )

Proposition 3.0.2. The multiplication rule.

P (E1E2E3 . . . En) = P (E1)P (E2|E1)P (E3|E1E2) . . . P (En|E1 . . . En−1)

Example 3.0.3. Let us discuss an example. Toss a die twice n and toss two dice.There are 36 outcomes. Assume all outcomes are equally likely (fair dice). There is1/36 chance for each outcome. Suppose that we observe that the first die is a 4. Giventhis information, what is the chance the sum of the two dice is no bigger than 7?

In this case, we have (4,1), (4,2), (4,3), (4,4), (4,5), (4,6). Then the chance 3/6 =1/2.

Suppose we observe one of two dice is a 4, then the probability becomes what? Theanswer is 6/11.Remark 3.0.4. Let E, F be two events and let P (F ) > 0, and we have

P (E|F ) = P (EF )P (F )

Given F already occurred, the chance of E will occur is P (E|F )

3.1 Bayes’ FormulaLet E and F be events. We may express E as

E = EF ∪ EF c

for, in order for an outcome to be in E, it must either be in both E and F or be in Ebut not in F. As EF and EFc are clearly mutually exclusive, we have, by Axiom 3, wehave

P (E) = P (EF ) + P (EF c)= P (E|F )P (F ) + P (E|F c)P (F c)= P (E|F )P (F ) + P (E|F c)[1 − P (F )]

Definition 3.1.1. The odds of an event A are defined by

P (A)P (Ac) = P (A)

1 − P (A)


That is, the odds of an event A tell how much more likely it is that the event A occursthan it is does not occur. For instance, P (A) = 2

3 , then P (A) = 2P (Ac), so the oddsare 2. If the odds are equal to α, then it is common to say that the odds are “α to 1”in favor of the hypothesis.

Consider now a hypothesis H that is true with probability P (H), and suppose thatnew evidence E is introduced. Then, the conditional probabilities, given evidence E,that H is that and that H is not true are respectively given by

P (H|E) = P (E|H)P (H)P (E) and P (Hc|E) = P (E|Hc)P (Hc)

P (E)

Therefore, the new odds after the evidence E has been introduced are

P (H|E)P (Hc|E) = P (H)

P (Hc)P (E|H)P (E|Hc)

We can further generalize: Suppose that F1, ..., Fn are mutually exclusive eventssuch that

n⋃i=1

Fi = S

In other words, exactly one of the events F1, ..., Fn must occur. By writing,

E =n⋃

i=1

EFi

and using the fact that the events EFi, for i = 1, ..., n are mutually exclusive, we obtain

P (E) =n∑

i=1

P (EFi) =n∑

i=1

P (E|Fi)P (Fi)

Let F1, ..., Fn be a set of mutually exclusive and exhaustive events (meaning thatexactly one of these events must occur). Suppose now that E has occurred and we areinterested in determining which one of the Fj also occurred. Then we haveProposition 3.1.2. IMPORTANT We have the following proposition:

P (Fj |E) = P (EFj)P (E)

= P (E|Fj)P (Fj)∑n

i=1 P (E|Fi)P (Fi)

which is known as Bayes’ formula.Example 3.1.3. Discuss an example from class. There is a fair deck of cards (a fairdeck has 52 cards and each is drawn equally likely). Deal the cards to 4 players (eachhas 1 card), say E, W, N, and S. If North has 6 spades, what is the chance that Easthas 3 spades?

Use reduced sample space: Given N has 6 spades and other 7 (non-spade), E, W,S will share other 7 spades among 13 × 3 = 39 cards. Hence,(7

3

)(3210

)(3913

)


3.2 Independent EventsFrom the idea of Bayes’ rule, we can discuss the notion of independent events.Definition 3.2.1. Consider two events E and F. They are independent if equation

P (EF ) = P (E)P (F )

holds. If they are not independent, we say they are dependent.Proposition 3.2.2. If E and F are independent, then so are E and Fc.Definition 3.2.3. Three events E, F, and G are said to be independent if

P (EF G) = P (E)P (F )P (G)P (EF ) = P (E)P (F )P (EG) = P (E)P (G)P (F G) = P (F )P (G)

Example 3.2.4. A famous example is Gambler’s Ruin. Please see [1] page 84.Proposition 3.2.5. We have the following properties

1. 0 ≤ P (E|F ) ≤ 12. P (S|F ) = 13. If Ei, for i = 1, 2, ... are mutually exclusive events, then

lP

( ∞⋃i=1

Ei|F)

=∞∑

i=1

P (Ei|F )

Let us discuss a birth problem as this problem can be related many problems inconditional probabilities.Example 3.2.6. Female chimp gave birth. It is not certain which of two male chimpsis the father. Before genetic analysis, it is believed that the probability that malenumber 1 is the father is p and the probability that male number 2 is the father is1 − p. DNA obtained from the mother, male number 1, and male number 2 indicatesthat on one specific location of the genome, the mother has the gene pair (A,A), malenumber 1 has gene pair (a,a), and male number 2 has the gene pair (A,a). If a DNAtest shows that the baby chimp has the gene pair (A,a), what is the probability thatmale number 1 is the father?

Answer. Let Mi be the event that male number i is the father. Let BA,a be the eventthat baby chimp has the gene pair (A,a). Then P (M1|BA,a) is obtained:

P (M1|BA,a) = P (M1BA,a)P (BA,a)

= P (BA,a|M1)P (M1)P (BA,a|M1)P (M1) + P (BA,a|M2)P (M2)

= 1 · p

1 · p + (1/2)(1 − p)

= 2p

1 + p


Now let us compare result with p

Figure 1: The figure presents the graph of 2p1+p − p.

0.5 1 1.5 2

−1

1

2

p

2p1+p − p

Hence, we arrived the inequality

2p

1 + p> p

We conclude the information that the baby’s gene pair is (A,a) increases the probabilitythat male number 1 is the father.


4 Random VariablesGo back to Table of Contents. Please click TOC

This section we discuss random variables. We will start with definition of randomvariables and begin our discussion with discrete random variables. We can then discussexpectation and variance (1st moment and 2nd moment) of the random variables.Afterwards, we will move forward to discuss Bernoulli and Binomial random variables(and Poisson) as special case studies.

4.1 Random VariablesWhen we perform an experiment, we are often times interested in some function of theoutcome. This way can generalize the situation and what will occur in future events.Example 4.1.1. Consider an experiment of tossing 3 fair coins. Let Y denote thenumber of heads. Then Y is a random variable taking one of the values 0, 1, 2, and 3with a certain probability respectively. We can write

P (Y = 0) = 1/8P (Y = 1) = 3/8P (Y = 2) = 3/8P (Y = 3) = 1/8

We notice that since Y must be one of the values from 0 through 3, then we must have

1 = P

( 3⋃i=0

(Y = i))

=3∑

i=0

P (Y = i)

Example 4.1.2. Consider another example. Four balls are to be randomly selected,without replacement, from an urn that contains 20 balls numbered 1 through 20. (Upto here, we know there are

(204

)possible outcomes.) Let X be the largest numbered

ball selected, then X is a random variable that takes on one of the values 4, 5, ..., 20.The probability that X takes on each of its possible values is

P (X = i) =(

i−13

)(204

) , for i = 4, ..., 20

Suppose we want to determine P (X > 10). One way is to compute

P (X > 10) =20∑

i=11

P (X = i) =20∑

i=11

(i−1

3

)(204

)We can, alternatively, compute the complement of the above event and subtract thatprobability from 100%. We omit the computation here. One can refer to text [1] page113.

All of these are motivating examples that it is necessary to come up with a notionof random variable instead of paying attention to a single event.


4.2 Discrete Random VariablesA random variable that can take on at most a countable number of possible values issaid to be discrete. For a discrete random variable X, we define the probability massfunction p(a) of X by

p(a) = P (X = a)The probability mass function p(a) is positive for at most a countable number of valuesof a. That is, if X must assume one of the values x1, ..., then

p(xi) ≥ 0 for i = 1, 2, ...

p(x) = 0 for all other values of x

Since X must take on one of the values xi, we have∞∑

i=1

p(xi) = 1

Example 4.2.1. The probability mass function of a random variable X is given byp(i) = cλi/i!, i = 0, 1, 2, ..., where λ is some positive value. Find (a) P (X = 0) and (b)P (X > 2).

Answer. Since∑∞

i=0 p(i) = 1, we have

c

∞∑i=0

λi

i! = 1

which, using ex =∑∞

i=0 xi/i!, implies that

ceλ = 1 or c = e−λ

Thus, we have1. P (X = 0) = e−λλ0/0! = e−λ.

2. P (X > 2) = 1 − P (X ≤ 2) = 1 − e−λ − λe−λ − λ2e−λ

2

The cumulative distribution function F can be expressed in terms of p(a) by

F (a) =∑

all x≤a

p(x)

If X is a discrete random variable whose possible values are x1, x2, ..., where x1 < x2 <x3 < ..., then the distribution function F of X is a step function. That is, the value ofF is constant in the intervals (xi−1, xi) and then takes a step (or jump) of size p(xi)at xi. For instance, if X has a probability mass function given by

p(1) = 14 , p(2) = 1

2 , p(3) = 18 , p(4) = 1

8then its cumulative distribution function is

F (a) =

0 a < 114 1 ≤ a < 234 2 ≤ a < 378 3 ≤ a < 41 4 ≤ a


4.3 Expected ValueOne of the most important concepts in probability theory is that of the expectationof a random variable. If X is a discrete random variable having a probability massfunction p(x), or the expected value, of X, denoted by E[X], is defined by

E[X] =∑

x:p(x)>0

xp(x)

The expected value of X is a weighted average of the possible values that X can takeon, each value being weighted by the probability that X assumes it.Example 4.3.1. Find E[X], where X is the outcome when we roll a fair die.

Answer. Since p(1) = p(2) = p(3) = p(4) = p(5) = p(6) = 1/6, we obtain

E[X] = 1(1

6)

+ 2(1

6)

+ 3(1

6)

+ 4(1

6)

+ 5(1

6)

+ 6(1

6)

= 7/2

4.4 Expectation of a Function of a Random VariableSuppose that we are given discrete random variable along with its probability massfunction and we want to compute expected value of some function of X, say, g(X). Wecan determine E[g(X)] by using definition of expected value.Example 4.4.1. Let X denote a random variable that takes on any of the values -1,0, and 1 with respective probabilities

P (X = −1) = 0.2, P (X = 0) = 0.5, P (X = 1) = 0.3

Compute E[X2].

Answer. Let Y = X2. Then the probability mass function of Y is given by

P (Y = 1) = P (X = −1) + P (X = 1) = 0.5P (Y = 0) = P (X = 0) = 0.5

Hence,E[X2] = E[Y ] = 1(0.5) + 0(0.5) = 0.5

Proposition 4.4.2. If X is a discrete random variable that takes on one of the valuesxi, i ≥ 1, with respective probabilities p(xi), then, for any real-valued function g,

E[g(X)] =∑

i

g(xi)p(xi)

Proof. Please see text [1] Page 122 for proof.

Proposition 4.4.3. If a and b are constants, then

E[aX + b] = aE[X] + b


Proof. We prove

E[aX + b] =∑

x:p(x)>0

(ax + b)p(x)

= a∑

x:p(x)>0

xp(x) + b∑

x:p(x)>0

p(x)

= aE[X] + b

The expected value of a random variable X, E[X], is also referred to as the meanor the first moment of X. The quantity E[Xn], n ≥ 1, is called the nth moment of X.By Proposition 4.1, we note that

E[Xn] =∑

x:p(x)>0

xnp(x)

4.5 VarianceBesides expectation, it is also imporant to measure the variation.Definition 4.5.1. If X is a random variable with mean µ, then the variance of X,denoted by Var(X), is defined by

Var(X) = E[(X − µ)2]

Alternatively, one can derive

Var(X) = E[(X − µ)2]

=∑

x

(x − µ)2p(x)

=∑

x

(x2 − 2µx + µ2)p(x)

=∑

x

x2p(x) − 2µ∑

x

xp(x) + µ2∑

x

p(x)

= E[X2] − 2µ2 + µ2

= E[X2] − µ2

Example 4.5.2. Calculate Var(X) if X represents the outcome when a fair die isrolled.

Answer. You can easily find E[X] = 72 . Now, we find

E[X2] = 12(1/6) + 22(1/6) + 32(1/6) + 42(1/6) + 52(1/6) + 62(1/6)= (91)(1/6)

and thus we have variance

Var(X) = 916 −

(72

)2

= 3512


A useful identity is that for any constants a and B,

Var(aX + b) = a2Var(X)

To prove this equality, let µ = E[X] and note that E[aX + b] = aµ + b. Therefore,we have

Var(aX + b) = E[(aX + b − aµ − b)2]= E[a2(X − µ)2]= a2E[(X − µ)2]= a2Var(X)

Remark 4.5.3. Please note the following.1. Analogous to the means being the center of gravity of a distribution of mass, the

variance represents, in the terminology of mechanics, the moment of inertia.2. The square root of the Var(X) is called the standard deviation of X, and we

denote it by SD(X). That is,

SD(X) =√

Var(X)

Discrete random variables are often classified according to their probability massfunctions. In the future, we may deal with probability distribution function (orprobability density function) for continuous random variables.

4.6 The Bernoulli and Binomial Random VariablesSuppose that a trial, or an experiment, whose outcome can be classified as either asuccess or a failure is performed. If we let X = 1 when the outcome is a success andX = 0 when it is a failure, then the probability mass function of X is given by

p(0) = P (X = 0) = 1 − p

p(1) = P (X = 1) = p

where p, 0 ≤ p ≤ 1, is the probability that the trial is a success. A random variableX is said to be a Bernoulli random variable if its probability mass function is given bythe above equation for some p ∈ (0, 1).

Suppose now that n independent trials, each of which results in a success withprobability p or in a failure with probability 1−p, are to be performed. If X representsthe number of successes that occur in the n trials, then X is said to be a binomialrandom variable with parameters (n, p). Thus, a Bernoulli random variable is just abinomial random variable with parameters (1, p). The probability mass function of abinomial random variable having parameters (n, p) is given by

p(i) =(

n

i

)pi(1 − p)n−i for i = 0, 1, ..., n

Example 4.6.1. It is known that screws produced by a company can be defectivewith probability 0.01, independently of one another. The company sells the screws inpackages of 10 and offers a money-back guarantee that at most 1 of the 10 screws isdefective. What proportion of packages sold must the company replace?


Answer. If X is the number of defective screws in a package, then X is a binomialrandom variable with parameters (10, 0.01). Hence, the probability that a package willhave to be replaced is

1 − P (X = 0) − P (X = 1) = 1 −(

100

)(0.01)0(0.99)10 −

(101

)(0.01)1(0.99)9 ≡ 0.004

Example 4.6.2. Consider a jury trial in which it takes 8 of the 12 jurors to convict thedefendant; that is, in order for the defendant to be convicted, at least 8 of the jurorsmust vote him guilty. If we assume that jurors act independently and that whether ornot the defendant is guilty, each makes the right decision with probability θ, what isthe probability that the jury renders a correct decision?

Answer. The problem, as stated, is incapable of an actual solution. However, we canwork out an expression to model this environment. The situation is binary and wehave either guilty or not guilty. The former requires 8 votes and the latter requires 5.Hence, we have

if he is guilty:12∑

i=8

(12i

)θi(1 − θ)12−i

if he is not Guilty:12∑

i=5

(12i

)θi(1 − θ)12−i

and hence, by letting probability that the defendant is guilty to be α, we can write outthe expression for rendering a correct decision

α

12∑i=8

(12i

)θi(1 − θ)12−i + (1 − α)

12∑i=5

(12i

)θi(1 − θ)12−i

We can examine the properties of a binomial random variable with parameters nand p. To begin, let us compute its expected value and variance. To begin, note that

E[Xk] =n∑

i=0

ik

(n

i

)pi(1 − p)n−i

=n∑

i=1

ik

(n

i

)pi(1 − p)n−i

Using the identity

i

(n

i

)= n

(n − 1i − 1

)


gives

E[Xk] = np

n∑i=1

ik−1(

n − 1i − 1

)pi−1(1 − p)n−i

= np

n−1∑j=0

(j + 1)k−1(

n − 1j

)pj(1 − p)n−1−j , let j = i − 1

= npE[(Y + 1)k−1]

where Y is a binomial random variable with parameters n − 1, p. Setting k = 1, wewould arrive

E[X] = np

which gives us the expected number of successes that occur in n independent trialswhen each is a success with probability p. Setting k = 2, we yield

E[X2] = npE[Y + 1]= np[(n − 1)p + 1]

Since E[X] = np, we obtain

E[X] = np

Var(X) = np(1 − p)

4.7 Poisson random VariableIMPORTANT A random variable X that takes on one of the values 0, 1, 2, ...is saidto be Poisson random variable with parameter λ if, for some λ > 0,

p(i) = P (X = i) = e−λ λi

i! for i = 0, 1, 2, ...

and it has property λ = np.We can check the following:

∞∑i=0

p(i) = e−λ

∞∑i=0

λi

i! = e−λeλ = 1

Some general applications that obey Poisson probability are1. The number of misprints on a page (or a group of pages) of a book.2. The number of people in a community who survive to age 100.3. The number of wrong telephone numbers that are dialed in a day.4. The number of packages of a dog biscuits sold in a particular store each day5. The number of vacancies occurring during a year in the federal judicial system.6. The number of α-particles discharged in a fixed period of time from some radioac-

tive material.Example 4.7.1. Suppose number of typographical errors on a single page of this bookhas a Poisson distribution with parameter λ = 1

2 . Calculate the probability that thereis at least one error on your page.


Answer. Letting X denote the number of error on a page, we have

P (X ≥ 1) = 1 − P (X = 0) = 1 − e−1/2 = 0.393

Example 4.7.2. Suppose that the probability that an item produced by a certainmachine will be defective is 0.1. Find the probability that a sample of 10 items willcontain at most 1 defective item.

Answer. The desired probability is(100

)(0.1)0(0.9)10 +

(101

)(0.1)1(0.9)9 = 0.74

while Poisson approximation would yield similar results.


5 Continuous Random VariablesGo back to Table of Contents. Please click TOC

The previous chapter discussed discrete random variables, e.g. the random variablewhose set of possible values is either finite or countably infinite. There also existrandom variables whose set of possible values to be uncountable. Consider X be sucha random variable. We say that X is continuous random variable if there exists anonnegative function f , defined for all real x ∈ (−∞, ∞), having the property that forany set B of real numbers,

P (X ∈ B) =∫

B

f(x)dx

The function f is called the probability density function of the random variable X.In words, the above equation states that the probability that X will be in B may

be obtained by integrating the probability density function over the set B. Since Xmust assume some value, f must satisfy

1 = P (X ∈ (−∞, ∞)) =∫ ∞

−∞f(x)dx

All probability statements about X can be answered in terms of f .Example 5.0.1. Letting B = [a, b], we obtain

P (a ≤ X ≤ b) =∫ b

a

f(x)dx

Example 5.0.2. IMPORTANT Suppose X is a continuous random variable whoseprobability density function is

f(x) =

C(4x − 2x2) 0 < x < 20 else

1. What is the value of C?2. Find P (X > 1).

Answer. We have the following1. Since f is a probability density function, we must have∫ ∞

−∞f(x)dx = 1

and we can solve C∫ 2

0 (4x − 2x2)dx = 1. After integration, we have C(2x2 −2x3

3 )|2x=0 = 1 and we have result C = 38 .

2. P (X > 1) =∫ ∞

1 f(x)dx = 38

∫ 21 (4x − 2x2)dx = 1

2 .

Example 5.0.3. The amount of time in hours that a computer functions before break-ing down is a continuous random variable with probability density function given by

f(x) =

λe−x/100 x ≥ 00 x < 0

What is the probability that


1. a computer will function between 50 and 150 hours before breaking down?2. it will function for fewer than 100 hours?

Answer. We solve the parts accordingly1. Since 1 =

∫ ∞−∞ f(x)dx = λ

∫ ∞0 e−x/100dx, we can take integral and obtain 1 =

−λ(100)e−x/100|∞0 = 100λ. We can solve for λ = 1λ

. Then we can proceed to findthe probability

P (50 < X < 1500) =∫ 150

50

1100e−x/100dx

= −e−x/100|15050

= e−1/2 − e−3/2

= 0.383

2. I will leave this to you as an exercise.

5.1 Expectation and Variance of Continuous Random Vari-ablesIn discrete senses, we defined the expected value of a discrete random variable X by

E[X] =∑

x

xP (X = x)

If X is a continuous random variable having probability density function f(x), thenbecause

f(x) ≈ P (x ≤ X ≤ x + dx) for dx smallit is easy to see that the analogous definition is to define the expected value of X by

E[X] =∫ ∞

−∞xf(x)dx

Example 5.1.1. Find E(X) when the density function of X is

f(x) =

2x if 0 ≤ x ≤ 10 else

Answer. Solve the following

E(X) =∫

xf(x)dx

=∫ 1

02x2

= 23

Proposition 5.1.2. If X is a continuous random variable with probability densityfunction f(x), then, for any real-valued function,

E[g(X)] =∫ ∞

−∞g(x)f(x)dx


Example 5.1.3. The density function of X is given by

f(x) =

1 if 0 ≤ x ≤ 10 else

Find E[eX ].

Answer. Let Y = eX . We start by determining FY , the cumulative distribution func-tion of Y . Now, for 1 ≤ x ≤ e,

FY (x) = P (Y ≤ x)

= P (eX ≤ x)= P (X ≤ log(x))

=∫ log(x)

0f(y)dy

= log(x)

By differentiating FY (x), we can conclude that the probability density function of Yis given by

fY (x) = 1x

for 1 ≤ x ≤ e

Hence,

E[eX ] = E[Y ] =∫ ∞

−∞xfY (x)dx

=∫ e

1dx

= e − 1

Lemma 5.1.4. For a nonnegative random variable Y ,

E[Y ] =∫ ∞

0P (Y > y)dy

Lemma 5.1.5. If a and b are constants, then

E[aX + b] = aE[X] + b

The variance of a continuous random variable is defined exactly as it is for a discreterandom variable, namely, if X is a random variable with expected value µ, then thevariance of X is defined (for any type of random variable) by

Var(X) = E[(X − µ)2]

The alternative formula,Var(X) = E[X2] − (E[X])2

Example 5.1.6. Recall the example above,

f(x) =

2x if 0 ≤ x ≤ 10 else

Find Var(X) for this random variable X.


Answer. First, we compute the second moment E[X2],

E[X2] =∫ ∞

−∞x2f(x)dx

=∫ 1

02x3dx = 1

2

Hence, we obtain

Var(X) = 12 −

(23

)2

= 118

Note that we have the property Var(aX + b) = a2Var(X).

5.2 Uniform Random VariableA random variable is said to be uniformly distributed over the interval (0, 1) if itsprobability density function is given by

f(x) =

1 0 < x < 10 else

Since this is a density function, the following properties hold (1) f(x) ≥ 0 and (2)∫x

f(x)dx = 1.In general, we say that X is a uniform random variable on the interval (α, β) if the

probability density function of X is given by

f(x) =

1β−α

if α < x < β

0 else

Since F (a) =∫ a

−∞ f(x)dx, it follows that

f(x) =

0 a ≤ α1

β−αif α < x < β

1 a ≥ β

Example 5.2.1. Let X be uniformly distributed over (α, β). Find (a) E[X] and (b)Var(X).

Answer. We proceed accordingly1. Compute

E[X] =∫ ∞

−∞xf(x)dx

=∫ β

α

x

β − αdx

= β2 − α2

2(β − α)

= β + α

2


2. To find Var(X), first calculate E[X2].

E[X2] =∫ β

α

1β − α

x2dx

= β3 − α3

3(β − α)

= β2 + αβ + α2

3

Hence,

Var(X) = β2 + αβ + α2

3 − (α + β)2

4 = (β − α)2

12

Example 5.2.2. If X is uniformly distributed over (0, 10), calculate the probabilitythat X < 3.

Answer. Compute P (X < 3) =∫ 3

01

10 dx = 310 .

Example 5.2.3. IMPORTANT Buses arrive at a specified stop at 15-minute intervalsstarting at 7AM. That is, they arrive at 7, 7:15, 7:30, 7:45, and so on. If a passengerarrives at the stop at a time that is uniformly distributed between 7 and 7:30, find theprobability that he waits

1. less than 5 minutes for a bus;2. more than 10 minutes for a bus.

Answer. We proceed accordingly1. Compute

P (10 < X < 15) + P (25 < X < 30) =∫ 15

10

130dx +

∫ 30

25

130dx

= 130

2. ComputeP (0 < X < 5) + P (15 < X < 20) = 1

3

5.3 Normal Random VariablesWe say that X is a normal random variable, or simply that X is normally distributed,with parameters µ and σ2 if the density of X is given by

f(x) = 1√2πσ

e−(x−µ)2/2σ2

for −∞ < x < ∞. The density function is a bell-shaped curve that is symmetric aboutµ.Example 5.3.1. Find E[X] and Var(X) when X is a normal random variable withparameters µ and σ2.


Answer. Let us start by finding the mean and variance of the standard normal randomvariable Z = (X − µ)/σ. We have

E[Z] =∫ ∞

−∞xfZ(x)dx

= 1√2π

∫ ∞

−∞xe−x2/2dx

= − 1√2π

e−x2/2|∞−∞

= 0

Thus,

Var(Z) = E[Z2]

= 1√2π

∫ ∞

−∞x2e−x2/2dx, IBP: let µ = x and dν = xe−x2/2

= 1√2π

(− xe−x2/2|∞−∞ +

∫ ∞

−∞e−x2/2dx︸︷︷︸=1

)= 1

Because X = µ + σZ, the preceding yields the results

E[X] = µ + σE[Z] = µ

andVar(X) = σ2Var(Z) = σ2

Conventionally, we denote the cumulative distribution function of a standard normalrandom variable by Φ(x). That is,

Φ(x) = 1√2π

∫ x

−∞e−y2/2dy

and we have table


Figure 2: Area Φ(x) from page 190 in [1]

Example 5.3.2. IMPORTANT If X is a normal random variable with parametersµ = 3 and σ2 = 9, find (a) P (2 < X < 5), (b) P (X > 0), and (c) P (|X − 3| > 6).

Answer. We proceed accordingly


1. Compute

P (2 < X < 5) = P

(2 − 3

3 <X − 3

3 <5 − 3

3

)= Φ(2

3) − Φ(−13)

= 0.3779

2. Compute

P (X > 0) = P

(X − 3

3 >0 − 3

3

)= P (Z > −1)= Φ(1) = 0.8413

3. Compute

P (|X − 3| > 6) = P (X > 9) + P (X < −3)

= P

(X − 3

3 >9 − 3

3

)+ P

(X − 3

3 <−3 − 3

3

)= P (X > 2) + P (Z < −2)= 0.0456

Example 5.3.3. An expert witness in a paternity suit testifies that the length (in days)of human gestation is approximately normally distributed with parameters µ = 270 andσ2 = 100. The defendant in the suit is able to prove that he was out of the countryduring a period that begun 290 days before the birth of the child and ended 240days before the birth. If the defendant was, in fact, the father of the child, what isthe probability that the mother could have had the very long or very short gestationindicated by the testimony?

Answer. Let X denote the length of the gestation, and assume that the defendant isthe father. Then the probability that the birth could occur within the indicated periodis

P (X > 290 or X < 240) = P (X > 290) + P (X < 240)

= P (X − 27010 > 2) + P (X − 270

10 < −3)

= 1 − Φ(2) + 1 − Φ(3)= 0.0241

Remark 5.3.4. Please be aware that the problem can ask you “or” instead of ”and”.In that case, the properties we learned from set theory follow. You should check theintersection between the two events accordingly.


Example 5.3.5. If X, the gain from an investment, is a normal random variable withmean µ and variance σ2, then because the loss is equal to the negative of the gain, theVAR of such an investment is that value ν such that

0.01 = P (−X > ν)

We compute the following

0.01 = P

(−X + µ

σ>

ν + µ

σ

)= 1 − Φ(ν + µ

σ)

and from table we know Φ(2.33) = 0.99 so we know ν+µσ

= 2.33. That is, ν = VAR =2.33σ − µ. Consequently, among set of investments all of whose gains are normallydistributed, the investment having the smallest VAR is the one having the largestvalue of µ − 2.33σ.Theorem 5.3.6. The DeMoivre-Laplace Theorem. If Sn denotes the number of suc-cesses that occur when n independent trials, each resulting in a success of probabilityp, are performed, then, for any a < b,

P

(a ≤ Sn − np√

np(1 − p)≤ b

)→ Φ(b) − Φ(a)

Example 5.3.7. Let X be the number of times that a fair coin that is flipped 40 timeslands on heads. Find the probability that X = 20. Use the normal approximation andthen compare it with the exact solution.

Answer. To employ normal approximation, note that because the binomial is a discreteinteger-valued random variable, whereas the normal is a continuous random variable,it is best to write P (X = i) as P (i − 1/2 < X < i + 1/2) before applying the normalapproximation (this is called the continuity correction). Hence, we compute

P (X = 20) = P (19.5 < X < 20.5)

= P

(19.5 − 20√

10<

X − 20√10

<20.5 − 20√

10

)= P

(− 1.6 <

X − 20√10

< 1.6)

= Φ(0.16) − Φ(−0.16)= 0.1272

5.4 Exponential Random VariableA continuous random variable whose probability density function is given, for someλ > 0, by

f(x) =

λe−λx if x ≥ 00 if x < 0


is said to be an exponential random variable with parameter λ. The cumulative distri-bution F (a) of an exponential random variable is given by

F (a) = P (X ≤ a)

=∫ a

0λe−λxdx

= −e−λx|a0= 1 − e−λa for a ≥ 0

Note that F (∞) =∫ ∞

0 λe−λx = 1.Example 5.4.1. IMPORTANT Let X be an exponential random variable with pa-rameter λ. Calculate (a) E[X] and (b) Var(X).

Answer. We solve the following accordingly1. We use E[Xn] =

∫ ∞0 xnλe−λdx. Integrating by parts (with λe−λx = dν and

µ = xn) yields

E[Xn] = −xne−λx|∞0 +∫ ∞

0e−λxnxn−1dx

= 0 + n

λ

∫ ∞

0λe−λxxn−1dx

= n

λE[Xn−1]

Letting n = 1 and n = 2 gives us

E[X] = 12 and E[X2] = 2

λE[X] = 2

λ2

2. We have variance

Var(X) = 2λ2 −

(1λ

)2

= 1λ2

Example 5.4.2. Suppose that the number of miles that a car can run before itsbattery wears out is exponentially distributed with an average value of 10,000 miles.If a person desires to take a 5000-mile trip, what is the probability that he or she willbe able to complete the trip without having to replace the car battery? What can besaid when the distribution is not exponential? (Assume the parameter λ = 1/10).

Answer. The desired probability is

P (remaining lifetime > 5) = 1 − F (5) = e−5λ = 0.607

However, if the lifetime distribution F is not exponential, then the relevant probabilityis

P (lifetime > t + 5|lifetime > t) = 1 − F (t + 5)1 − F (t)

where t is the number of miles that the battery had been in use prior to the start of thetrip. Therefore, if the distribution is not exponential, additional information is needed(namely, the value of t) before the desired probability can be calculated.


6 Jointly Distributed Random VariablesGo back to Table of Contents. Please click TOC

6.1 Joint Distribution FunctionsWe understand from above sections with probability distributions for single randomvariable. However, we are often interested in probability statements concerning two ormore random variables. In order to deal with such probabilities, we define for any tworandom variables X and Y , the joint cumulative probability distribution function of Xand Y by

F (a, b) = P(X ≤ a, Y ≤ b) for − ∞ < a, b < ∞The distribution of X can be obtained from the joint distribution of X and Y as follows

FX(a) = P(X ≤ a)= P(X ≤ a, Y < ∞)= P( lim

b→∞X ≤ a, Y ≤ b)

= limb→∞

P(X ≤ a, Y ≤ b)

= limb→∞

F (a, b)

= F (a, ∞)

Note that the preceding set of equalities, we have once again made use of the fact thatprobability is a continuous set function. Similarly, the cumulative distribution functionof Y is given by

FY (b) = P(Y ≤ b)= lim

a→∞F (a, b)

= F (∞, b)

In the case when X and Y are both discrete random variables, it is convenient todefine the joint probability mass function of X and Y by

p(x, y) = P(X = x, Y = y)

the probability mass function of X can be obtained from p(x, y) by

pX(x) = P(X = x)

=∑

x:p(x,y)>0

p(x, y)

Similarly, we havepY (y) =

∑x:p(x,y)>0

p(x, y)

We say that X and Y are jointly continuous if there exists a function f(x, y), definedfor all real x and y, having the property that for every set C of pairs of real numbers(that is, C is a set in the two-dimensional plane),

P((X, Y ) ∈ C) =∫∫

(x,y)∈C

f(x, y)dxdy


Example 6.1.1. IMPORTANT The joint density function of X and Y is given by

f(x, y) =

2e−xe−2y if 0 < x < ∞, 0 < y < ∞0 otherwise

Compute (a) P(X > 1, Y < 1), (b) P(X < Y ), and (c) P(X < a).

Answer. Please refer to the following• Compute

P(X > 1, Y < 1) =∫ 1

0

∫ ∞

12e−xe−2ydxdy

=∫ 1

02e−2y

(− e−x|∞1

)dy

= e−1∫ 1

02e−2ydy

= e−1(1 − e−2)

• Compute

P(X < Y ) =∫∫

(x,y):x<y

2e−xe−2ydxdy

=∫ ∞

02e−2y(1 − e−y)dy

=∫ ∞

02e−2ydy −

∫ ∞

02e−3ydy

= 1 − 23

= 13

• Compute

P(X < a) =∫ a

0

∫ ∞

02e−2ye−xdydx

=∫ a

0e−xdx

Example 6.1.2. The joint density of X and Y is given by

f(x, y) =

e−(x+y) if 0 < x < ∞, 0 < y < ∞0 otherwise

Find the density function of the random variable X/Y .


Answer. Start by computing the distribution function of X/Y . For a > 0,

FX/Y (a) = P

X

Y≤ a

=

∫∫x/y≤a

e−(x+y)dxdy

=∫ ∞

0

∫ ay

0e−(x+y)dxdy

=∫ ∞

0(1 − e−ay)e−ydy

=

− e−y + e−(a+1)y

a + 1

∣∣∣∣∞

0

= 1 − 1a + 1

Differentiation shows the density function of X/Y is given by fX/Y (a) = 1/(a + 1)2 for0 < a < ∞.

6.2 Independent Random VariablesThe random variables X and Y are said to be independent if, for any two sets of realnumbers A and B,

P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B)

In other words, X and Y are independent if, for all A and B, the events EA = X ∈ Aand FB = Y ∈ B are independent.

It can be shown by using the three axioms of probability that the above equationwill follow if and only if, for all a, b,

P(X ≤ a, Y ≤ b) = P(X ≤ a)P(Y ≤ b)

Hence, in terms of the joint distribution function F of X and Y , X and Y are inde-pendent if

F (a, b) = FX(a)FY (b) for all a, b

Proposition 6.2.1. The continuous (discrete) random variables X and Y are inde-pendent if and only if their joint probability density (mass) function can be expressedas

fX,Y (x, y) = h(x)g(y), −∞ < x < ∞, −∞ < y < ∞

Answer. Let us give the proof in the continuous case. First, note that independenceimplies that the joint density is the product of the marginal densities of X and Y , sothe preceding factorization will hold when the random variables are independent. Now,suppose that

fX,Y (x, y) = h(x)g(y)Then

1 =∫ ∞

−∞

∫ ∞

−∞fX,Y (x, y)dxdy

=∫ ∞

−∞h(x)dx

∫ ∞

−∞dy

= C1C2


where C1 =∫ ∞

−∞ h(x)dx and C2 =∫ ∞

−∞ g(y)dy. Also,

fX(x) =∫ ∞

−∞fX,Y (x, y)dy = C2h(x)

fY (y) =∫ ∞

−∞fX,Y (x, y)dx = C1g(y)

Since C1C2 = 1, it follows that

fX,Y (x, y) = fX(x)fY (y)

Example 6.2.2. IMPORTANT Let X, Y , Z be independent and uniformly distributedover (0, 1). Compute P (X ≥ Y Z).

Answer. Since

fX,Y,Z(x, y, z) = fX(x)fY (y)fZ(z) = 1, 0 ≤ x ≤ 1, 0 ≤ y ≤ 1, 0 ≤ z ≤ 1

we have

P (X ≥ Y Z) =∫∫∫

x≥yz

fX,Y,Z(x, y, z)dxdydz

=∫ 1

0

∫ 1

0

∫ 1

yz

dxdydz

=∫ 1

0

∫ 1

0(1 − yz)dydz

=∫ 1

0(1 − z

2 )dz

= 34

6.3 Sums of Independent Random VariablesIt is often important to be able to calculate the distribution of X + Y from the dis-tributions of X and Y when X and Y are independent. Suppose that X and Y areindependent, continuous random variables having probability density functions fX and


fY . The cumulative distribution function of X + Y is obtained as follows:

FX+Y (a) = P (X + Y ≤ a)

=∫∫

x+y≤a

fX(x)fY (y)dxdy

=∫ ∞

−∞

∫ a−y

−∞fX(x)fY (y)dxdy

=∫ ∞

−∞

∫ a−y

−∞fX(x)dxfY (y)dy

=∫ ∞

−∞fX(a − y)fY (y)dy

The cumulative distribution function FX+Y is called convolution of the distributionsFX and FY (the cumulative distribution functions of X and Y , respectively.

By differentiating the above equation, we find that the probability density functionfX+Y of X + Y is given by

fX+Y (a) = d

da

∫ ∞

−∞FX(a − y)fY (y)dy

=∫ ∞

−∞

d

daFX(a − y)fY (y)dy

=∫ ∞

−∞fX(a − y)fY (y)dy

Let us explore the relationship of two random variables. Recall gamma randomvariable has a density of the form

f(y) = λe−λy(λy)t−1

Γ(t) , 0 < y < ∞

An important property of this family of distributions is that for a fixed value of λ, itis closed under convolutions.Proposition 6.3.1. If X and Y are independent gamma random variables with re-spective parameters (s, λ) and (t, λ), then X + Y is a gamma random variable withparameters (s + t, λ).

fX+Y (a) = 1Γ(s)Γ(t)

∫ a

0λe−λ(a−y)[λ(a − y)]t−1λe−λy(λy)t−1dy

= Ke−λa

∫ a

0(a − y)s−1yt−1dy

= Ke−λaas+t−1∫ 1

0(1 − x)s−1xt−1dx, by letting x = y

a

= Ce−λaas+t−1

where C is a constant that does not depend on a. But, as the preceding is a densityfunction and thus must integrate to 1, the value of C is determined, and we have

fX+Y (a) = λe−λa(λa)s+t−1

Γ(s + t)Hence, the result is proved.


Proposition 6.3.2. If Xi, i = 1, ..., n, are independent random variables that arenormally distributed with respective parameters µi, σ2

i , i = 1, ..., n, then∑n

i=1 Xi isnormally distributed with parameters

∑n

i=1 µi and∑n

i=1 σ2i .

Proposition 6.3.3. If X and Y are independent Poisson random variables with re-spective parameters λ1 and λ2, compute the distribution of X + Y .


7 Properties of ExpectationGo back to Table of Contents. Please click TOC

7.1 IntroductionThis section we develop and exploit additional properties of expected values. Recallexpected value of the random variable X

E[X] =∑

x

xp(x)

where X is a discrete random variable with probability mass function p(x), and by

E[X] =∫ ∞

−∞xf(x)dx

when X is a continuous random variable with probability density function f(x).

7.2 Expectation of Sums of Random VariablesLet us begin by introducing one of the most important properties in expectation ofrandom variables.Proposition 7.2.1. If X and Y have a joint probability mass function p(x, y), then

E[g(X, Y )] =∑

y

∑x

g(x, y)p(x, y)

If X and Y have a joint probability density function f(x, y), then

E[g(X, Y )] =∫ ∞

−∞g(x, y)f(x, y)dxdy

Let us prove the above property.

Proof. Suppose we have random variables X and Y that are jointly continuous withjoint density function f(x, y) and when g(X, Y ) is a nonnegative random variable.Because g(X, Y ) ≥ 0, we have

E[g(X, Y )] =∫ ∞

0P (g(X, Y ) > t)dt

We can writeP (g(X, Y ) > t) =

∫∫(x,y):g(x,y)>t

f(x, y)dydx

shows thatE[g(X, Y )] =

∫ ∞

0

∫∫(x,y):g(x,y)>t

f(x, y)dydxdt

Interchanging the order of integration gives

E[g(X, Y )] =∫

x

∫y

∫ g(x,y)

t=0f(x, y)dtdydx

=∫

x

∫y

g(x, y)f(x, y)dydx

Thus, the result is proven when g(X, Y ) is a nonnegative random variable.


An application of such property can be used in the following application.Example 7.2.2. IMPORTANT An accident occurs at a point X that is uniformlydistributed on a road of length L. At the time of the accident, an ambulance is at alocation Y that is also uniformly distributed on the road. Assuming that X and Y areindependent, find the expected distance between the ambulance and the point of theaccident.

Answer. We want to compute E[|X − y|]. We have the joint density function of X andY to be

f(x, y) = 1L2 , 0 < x < L, 0 < y < L

and it follows from the property above

E[|X − Y |] = 1L2

∫ L

0

∫ L

0|x − y|dydx

Now we can do the math∫ L

0|x − y|dy =

∫ x

0(x − y)dy +

∫ L

x

(y − x)dy

= x2

2 + L2

2 − x2

2 − x(L − x)

= L2

2 + x2 − xL

Therefore,

E[|X − Y |] = 1L2

∫ L

0

(L2

2 + x2 − xL

)dx

= L

3

An important application of the above property is the following. Suppose E[X] andE[Y ] are both finite and let g(X, Y ) = X + Y . Then, in the continuous case,

E[X + Y ] =∫ ∞

−∞

∫ ∞

−∞(x + y)f(x, y)dxdy

=∫ ∞

−∞

∫ ∞

−∞xf(x, y)dydx +

∫ ∞

−∞yf(x, y)dxdy

= E[X] + E[Y ]

The same result holds in general; thus, whenever E[X] and E[Y ] are finite,

E[X + Y ] = E[X] + E[Y ]

Example 7.2.3. Let X1, ..., Xn be independent and identically distributed randomvariables having distribution function F and expected value µ. Such as sequence ofrandom variables is said to constitute a sample from the distribution F . The quantity

X =n∑

i=1

Xi

n

is called the sample mean. Compute E[X].


Answer. Compute

E[X] = E[ n∑

i=1

Xi

n

]= 1

nE

[ n∑i=1

Xi

]= 1

n

n∑i=1

E[Xi]

= µ since EXi ≡ µ

We conclude that the expected value of the sample mean is µ, the mean of the distri-bution. When the distribution mean µ is unknown, the sample mean if often used instatistics to estimate it.

7.3 Moments of the Number of Events that OccurLet us look at an example.Example 7.3.1. Suppose that there are N distinct types of coupons and that, inde-pendently of past types collected, each new one obtained is type j with probability pj ,∑N

j=1 pj = 1. Find the expected value and variance of the number of different typesof coupons that appear among the first n collected.

Answer. We will find it more convenient to work with the number of uncollected types.Let Y equal the number of types of coupons collected, and let X = N − Y denote thenumber of uncollected types. With Ai defined as the event that there are no typei coupons in the collection, X is equal to the number of the events A1, ..., AN thatoccur. Because the types of the successive coupons collected are independent, and,with probability 1 − pi each new coupon is not type i, we have

P (Ai) = (1 − pi)n

Hence, E[X] =∑N

i=1(1 − pi)n, from which it follows that

E[Y ] = N − E[X] = N −N∑

i=1

(1 − pi)n

Similarly, because each of the n coupons collected is neither a type i nor a type jcoupon, with probability 1 − pi − pj , we have

P(AiAj) = (1 − pi − pj)n, i 6= j

Thus,E[X(X − 1)] = 2

∑i<j

P(AiAj) = 2∑i<j

(1 − pi − pj)n

orE[X2] = 2

∑i<j

(1 − pi − pj)n + E[X]


Hence, we obtain

var(Y ) = var(X)= E[X2] − (EX)2

= 2∑i<j

(1 − pi − pj)n +N∑

i=1

(1 − pi)n −( N∑

i=1

(1 − pi)n

)2

7.4 Covariance, Variance of Sums, and CorrelationsThe following proposition shows that the expectation of a product of independentrandom variables is equal to the product of their expectations.Proposition 7.4.1. If X and Y are independent, then, for any functions h and g,

E[g(X)h(Y )] = E[g(X)]E[h(Y )]

Answer. Suppose that X and Y are jointly continuous with joint density f(x, y). Thenwe have

E[g(X)h(Y )] =∫ ∞

−∞

∫ ∞

−∞g(x)h(y)f(x, y)dxdy

=∫ ∞

−∞

∫ ∞

−∞g(x)h(y)fX(x)fY (y)dxdy

=∫ ∞

−∞h(y)fY (y)dy

∫ ∞

−∞g(x)fX(x)dx

= E[h(Y )]E[g(X)]

Definition 7.4.2. IMPORTANT The covariance between X and Y , denoted by Cov(X, Y ),is defined by

cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]Upon expanding the right side of the preceding definition, we see that

Cov(X, Y ) = E[XY − E[X]Y − XE[Y ] + E[Y ]E[X]]= E[XY ] − E[X]E[Y ] − E[X]E[Y ] + E[X]E[Y ]= E[XY ] − E[X]E[Y ]

Proposition 7.4.3. There are the following properties:• cov(X, Y ) = cov(Y, X)• cov(X, X) = var(X)• cov(aX, Y ) = aCov(X, Y )• cov(

∑n

i=1 Xi,∑m

j=1 Yj) =∑n

i=1

∑m

j=1 cov(Xi, Yj)


7.5 Conditional ExpectationIf X and Y are jointly discrete random variables, then the conditional probability massfunction of X, given that Y = y, is defined for all y such that P(Y = y) > 0, by

pX|Y (x|y) = P(X = x|Y = y) = p(x, y)pY (y)

It is therefore natural to define, in this case, the conditional expectation of X giventhat Y = y, for all values of y such that pY (y) > 0, by

E[X|Y = y] =∑

x

xP(X = x|Y = y)

=∑

x

xpX|Y (x|y)

7.6 Moment Generating FunctionsIMPORTANT The moment generating function M(t) of the random variable X isdefined for all real values of t by

M(t) = E[etX ] = ∑

xetxp(x) if X is discrete with mass function p(x)∫ ∞

−∞ etxf(x)dx if X is continuous with density f(x)

We call M(t) the moment generating function because all of the moments of X can beobtained by successively differentiating M(t) and then evaluating the result at t = 0.For example,

M ′(t) = d

dtE[etX ]

= E[ d

dt(etX)]

= E[XetX ]

where we have assumed that the interchange of the differentiation and expectationoperators is legitimate. That is, we have assumed that

d

dt

[ ∑x

etxp(x)]

=∑

x

d

dt[etxp(x)]

in the discrete case and

d

dt

[ ∫etxf(x)dx

]=

∫d

dt[etxf(x)]dx

in the continuous case. This assumption can almost always be justified and, indeed, isvalid for all of the distributions considered in this book. Hence, from above the firstderivative of moment generating function, evaluated at t = 0, we obtain

M ′(0) = E[X]


Similarly,

M ′′(t) = d

dtM ′(t)

= d

dtE[XetX ]

= E[

d

dt(XetX

]= E[X2etX ]

Thus, we haveM ′′(0) = E[X2]

In general, the nth derivative of M(t) is given by

Mn(t) = E[XnetX ], n ≥ 1

implying thatMn(0) = E[Xn], n ≥ 1

Example 7.6.1. IMPORTANT If X is a binomial random variable with parametersn and p, then

M(t) = E[etX ]

=n∑

k=0

etk

(n

k

)pk(1 − p)n−k

=n∑

k=0

(n

k

)(pet)k(1 − p)n−k

= (pet + 1 − p)n

where the last equality follows from the binomial theorem. Differentiation yields

M ′(t) = n(pet + 1 − p)n−1pet

Thus, we haveE[X] = M ′(0) = np

Differentiating a second time yeilds

M ′′(t) = n(n − 1)(pet + 1 − p)n−2(pet)2 + n(pet + 1 − p)n−1pet

soE[X2] = M ′′(0) = n(n − 1)p2 + np

The variance of X is given by

Var(X) = E[X2] − (E(X))2

= n(n − 1)p2 + np − n2p2

= np(1 − p)


Example 7.6.2. IMPORTANT If X is a Poisson random variable with parameter λ,then

M(t) = E[etX ]

=∞∑

n=0

etne−λe−λλn

n!

= e−λ

∞∑n=0

(λet)n

n!

= e−λeλet

= exp(λ(et − 1))

Differentiating yields

M ′(t) = λet exp(λ(et − 1))M ′′(t) = (λet)2 exp(λ(et − 1)) + λet exp(λ(et − 1))

Thus,

E[X] = M ′(0) = λ

E[X2] = M ′′(0) = λ2 + λ

var(X) = E[X2] − (E(X))2

= λ

Hence, both the mean and the variance of the Poisson random variable equal λ.Example 7.6.3. Let us find the first and second moment of exponential distributionwithy parameter λ.

M(t) = E[etX ]

=∫ ∞

0etxλe−λxdx

= λ

λ − tfor t < λ

We note from this derivation that for the exponential distribution, M(t) is defined onlyfor values of t less than λ. Differentiation of M(t) yields

M ′(t) = λ

(λ − t)2 , M ′′(t) = 2λ

(λ − t)3

Hence,E[X] = M ′(0) = 1

λ,E[X2] = M ′′(0) = 2

λ2

The variance of X is given by

var(X) = E[X2] − (E(X))2

= 1λ2

Let us summarize the moment generating function in the following table.


Example 7.6.4. IMPORTANT Calculate the distribution of X + Y when X and Yare independent Poisson random variables with means respectively λ1 and λ2.

Answer. We compute the following

MX+Y (t) = MX(t)MY (t)= exp(λ1(et − 1)) exp(λ2(et − 1))= exp((λ1 + λ2)(et − 1))

Hence, X + Y is Poisson distributed with mean λ1 + λ2.

It is also possible to define the joint moment generating function of two or morerandom variables. This is done as follows: for any n random variables X1, ..., Xn,the joint moment generating function, M(t1, .l.., tn), is defined, for all real values oft1, ..., tn, by

M(t1, ..., tn) = E[et1X1+···+tnXn ]The individual moment generating functions can be obtained from M(t1, ..., tn) byletting all but one of the tj ’s be 0. That is,

MXi (t) = E[etXi ] = M(0, ..., 0, t, 0, ..., 0)

where the t is in the ith place.It can be proven that the joint moment generating function M(t1, ..., tn) uniquely

determines the joint distribution of X1, ..., Xn. This result can then be used to prove


that the n random variables X1, ..., Xn are independent if and only if

M(t1, ..., tn) = MX1 (t1 . . . MXn (tn)

For the proof in one direction, if the n random variables are independent, then

M(t1, ..., tn) = E[e(t1X1+···+tnXn)]

= E[et1X1 . . . etnXn ]

= E[et1X1 ] . . .E[etn Xn] by independence= MX1 (t1) . . . MXn (tn)


8 Limit TheoremsGo back to Table of Contents. Please click TOC

8.1 IntroductionThe most important theoretical results in probability theory are limit theorems. Ofthese, the most important are those classified either under the heading laws of largenumbers of under the heading central limit theorems. Usually, theorems are consideredto be laws of large numbers if they are concerned with stating conditions under whichthe average of a sequence of random variables converges (in some sense) to the expectedaverage.

8.2 Chebyshev’s Inequality and the Weak Law of LargeNumbersLet us start with Markov’s Inequality. IMPORTANTProposition 8.2.1. If X is a random variable that takes only nonnegative values, thenfor any value a > 0,

P(X ≥ a) ≤ E[X]a

Proof. For a > 0, let

I =

1 if X ≥ a0 otherwise

and note that, since X ≥ 0, then we have I ≤ Xa

. Taking expectations of the precedinginequality yields

E[I] ≤ E[X]a

which, because E[I] = P(X ≥ a), proves the result.

Please see the following example.

# Packagelibrary(quantmod)

# Get DatagetSymbols(’FB’)data <- FBhead(data); tail(data)plot(data[,4], main = "Chart:StockPrice($)")

# Define Returnhead(data [,4]); head(lag(data [,4]))return <- data[,4]/lag(data [,4]) - 1summary(return );plot(return , main = "Chart:ReturnofStockPrice")hist(return , breaks = 100, main = "HistogramofReturns")

# Markov ’s Inequalitya <- 0.02p <- mean(na.omit(ifelse(return > a, 1, 0))); pexpected.value <- mean(na.omit(return )); expected.value


expected.value/a

# Summarize in tableSummary <- cbind(

Probability.of.Event = p,Expectation.over.Arbitrary.Value = expected.value/a

); Summary

# Define FunctionMarkov.Inequality <- function(a = 0.1)

# Get DatagetSymbols(’FB’)data <- FBhead(data); tail(data)plot(data[,4], main = "Chart:StockPrice($)")

# Define Returnhead(data [,4]); head(lag(data [,4]))return <- data[,4]/lag(data [,4]) - 1summary(return );plot(return , main = "Chart:ReturnofStockPrice")hist(return , breaks = 100, main = "HistogramofReturns")

# Markov ’s Inequalitya <- ap <- mean(na.omit(ifelse(return > a, 1, 0))); pexpected.value <- mean(na.omit(return )); expected.valueexpected.value/a


Probability.of.Event = p,Expectation.over.Arbitrary.Value = expected.value/a

); Summary

# Outputreturn(Summary)

# Runlapply(c(0.01, 0.05, 0.1, 0.15, 0.2), Markov.Inequality)

However, this will not give us correct answer. Who can spot the problem? Pleasereview the following.

Example 8.2.2. # The first line does not satisfy the inequality# Can anybody spot the mistake ?# Ans:

# Define FunctionMarkov.Inequality <- function(a = 0.1)

# Get Data


getSymbols(’AAPL’)data <- AAPLhead(data); tail(data)plot(data[,4], main = "Chart:StockPrice($)")

# Define Returnhead(data [,4]); head(lag(data [,4]))return <- data[,4]/lag(data [,4]) - 1summary(return ); plot(return , main = "Chart:ReturnofStockPrice")hist(return , breaks = 100, main = "HistogramofReturns")

# Markov ’s Inequalitya <- anumber.of.pos.obs <- sum(na.omit(ifelse(return > 0, 1, 0)));number.of.pos.obsnumber.of.event <- sum(na.omit(ifelse(return > a, 1, 0)));number.of.eventp <- number.of.event/number.of.pos.obs; pexpected.value <- mean(na.omit(return[ifelse(return > 0, 1, 0) == 1 ,]));expected.valueexpected.value/a


Value.of.Interest = a,Probability.of.Event = p,Expectation.over.Arbitrary.Value = expected.value/a

); Summary


# Runlapply(c(0.001 , 0.005, 0.01, 0.05, 0.1, 0.15, 0.2), Markov.Inequality)Report <- matrix(unlist(lapply(c(0.001 , 0.005, 0.01, 0.05, 0.1, 0.15, 0.2),Markov.Inequality )), nrow = 3);Report <- t(Report ); colnames(Report) <- c("Value.of.Interest","Prob.of.Event", "Exp.Over.Arbi.Value"); Report

Proposition 8.2.3. IMPORTANT Chebyshev’s Inequality. If X is a random variablewith finite mean µ and variance σ2, then for any value k > 0,

P((X − µ)2 ≥ k) ≤ σ2

k2

Proof. Since (X − µ)2 is a nonnegative random variable, we can apply Markov’s In-equality (with a = k2) to obtain

P((X − µ)2 ≥ k2) ≤ E[(X − µ)2]k2

But since (X − µ)2 ≥ k2 if and only if |X − µ| ≥ k, than what is written above isequivalent to

P(|X − µ| ≥ k) ≤ E[(X − µ)2]k2 = σ2

k2


and we are done.

# Get DatagetSymbols(’FB’)

# Define FunctionChebyshev.Inequality <- function(k = 0.1)

# Checkif (k < 0)

return(print(paste(

"ErrorMessage:ChebyshevInequalityrequiresktobenonnegative.Pleasecheckthevalueofk."

))

) else

# Get Datadata <- AAPLhead(data); tail(data)plot(data[,4], main = "Chart:StockPrice($)")

# Define Returnhead(data [,4]); head(lag(data [,4]))return <- data[,4]/lag(data [,4]) - 1summary(return ); plot(return , main = "Chart:ReturnofStockPrice")hist(return , breaks = 100, main = "HistogramofReturns")

# Markov ’s Inequalitymu <- mean(na.omit(return ))p <- mean(na.omit(as.numeric(abs(return - mu) > k)))sigma <- var(na.omit(return ))


Value.of.Interest = k,Probability.of.Event = p,Variance.over.Arbitrary.Value.Square = sigma/kˆ2

)


# Run# lapply (c(0.001 , 0.005 , 0.01 , 0.05 , 0.1, 0.15 , 0.2) , Chebyshev . Inequality )Report <- matrix(unlist(lapply(c(0.001 , 0.005, 0.01, 0.05, 0.1, 0.15, 0.2),Markov.Inequality )), nrow = 3);Report <- t(Report ); colnames(Report) <- c("Value.of.Interest","Prob.of.Event", "Var.over.Arbi.Value.Sq"); Report


# What happen if k is negative ?Chebyshev.Inequality (-0.1)

Example 8.2.4. If X is uniformly distributed over the interval (0, 10), then what isthe probability that X and value 5 has distance greater than 4.

Answer. Let us work this out a step at a time.• First, we compute expectation: E(X) = a+b

2 = 102 = 5;

• Second, we compute variance: E(X) = (b−a)2

12 = 10012 = 25

3

• Compute the result using Chebyshev’s Inequality:

P(|X − 5| > 4) = 25/316 = 0.52

Theorem 8.2.5. Weak Law of Large Numbers. Let X1, X2, ... be a sequence of indepen-dent and identically distributed random variables, each having finite mean E[Xi] = µ.Then, for any ε > 0,

P∣∣∣∣X1 + · · · + Xn

n− µ

∣∣∣∣ ≥ ε

→ 0 as n → ∞

Proof. We shall prove this theorem only under the additional assumption that therandom variables have a finite variance σ2. Now, since

E[

X1 + · · · + Xn

n

]= µ and Var

(X1 + · · · + Xn

n

)= σ2

n

it follows from Chebyshev’s Inequality that

P∣∣∣∣X1 + · · · + Xn

n− µ

∣∣∣∣ ≥ ε

→ 0 as n → ∞

8.3 The Central Limit TheoremThe Central Limit Theorem is one of the most remarkable results in probability theory.Loosely put, it states that the sum of a large number of independent random variableshas a distribution that is approximately normal. Hence, it not only provides a sim-ple method for computing approximate probabilities for sums of independent randomvariables, but also helps explain the remarkable fact that the empirical frequencies ofso many natural populations exhibit bell-shaped (that is, normal) curves.Theorem 8.3.1. The Central Limit Theorem. Let X1, X2, ... be a sequence of inde-pendent and identically distributed random variables, each having mean µ and varianceσ2. Then the distribution of

X1 + . . . Xn − nµ

σ√

n

tends to the standard normal as n → ∞. That is, for −∞ < a < ∞,

P

X1 + · · · + Xn − nµ

σ√

n

→ 1√

2π

∫ a

−∞e−x2/2dx as n → ∞


Lemma 8.3.2. Let Z1, Z2, ... be a sequence of random variables having distributionfunctions FZn and moment generating functions MZn , n ≥ 1, and let Z be a ran-dom variable having distribution function FZ and moment generating function MZ . IfMZn (t) → MZ(t) for all t, then FZn (t) → FZ(t) for all t at which FZ(t) is continuous.

If we let Z be a standard normal random variable, then, since MZ(t) = et2/2, itfollows from above lemma that if MZn (t) → et2/2 as n → ∞, then FZn (t) → Φ(t) asn → ∞.

Now let us produce the following proof.

Proof. Suppose µ = 0 and σ2 = 1. We prove the theorem under the assumption thatthe moment generating function of the Xi, M(t), exists and is finite. Now the momentgenerating function of Xi/

√n is given by

E[

exp

tXi√n

]= M

( t√n

)Thus, the moment generating function of

∑n

i=1 Xi/√

n is given by[M

(t√n

)]n. Let

L(t) = log M(t)

and note that

L(0) = 0

L′(0) = M ′(0)M(0)

= µ

= 0

L′′(0) = M(0)M ′′(0) − [M ′(0)]2

[M(0)]2

= E[X2]= 1

Now, to prove the theorem, we must show that [M(t/√

n)]n → et2/2 as n → ∞, or,equivalently, that nL(t/

√n) → t2/2 as n → ∞. To show this, note that

limn→∞

L(t/√

n

n−1 = limn→∞

−L′(t/√

n)n−3/2t

−2n−2 , by L’Hopital’s Rule

= limn→∞

[L′(t/

√n)t

2n−1/2

]= lim

n→∞

[−L′′(t/

√n)n−3/2t2

−2n−3/2

], again by L’Hopital’s Rule

= limn→∞

[L′′( t√

n

t2

2

]= t2

2

Thus, the central limit theorem is proven when µ = 0 and σ2 = 1. The result nowfollows in the general case by considering the standardized random variables X∗

i =(Xi − µ)/σ and applying the preceding result, since E[X∗

i ] = 0, var(X∗i ) = 1.


Theorem 8.3.3. Central limit Theorem for independent random variables. Let X1, X2, ...be a sequence of independent random variables having respective means and variancesµi = E[Xi], σ2

i = Var(Xi). If (a) the Xi are uniformly bounded – that is, if for someM , P(|Xi| < M) = 1 for all i, and (b)

∑∞i=1 σ2

i = ∞ then we have

P∑n

i=1(Xi − µi)√∑n

i=1 σ2i

≤ a

→ Φ(a) → as n → ∞

8.4 The Strong Law of Large NumbersThe strong law of large numbers is probably the best-known result in probability theory.It states that the average of a sequence of independent random variables having acommon distribution will, with probability 1, converge to the mean of that distribution.Theorem 8.4.1. Let X1, X2, ... be a sequence of independent and identically distributedrandom variables, each having a finite mean µ = E[Xi]. Then, with probability 1,

X1 + · · · + Xn

n→ µ as n → ∞

Remark 8.4.2. Here we say converges in probability and what we mean is the following

P( limn→∞

(X1 + · · · + Xn)/n = µ) = 1

8.5 Other InequalitiesWe are sometimes confronted with situations in which we are interested in obtainingan upper bound for a probability of the form P(X − µ ≥ a), where a is some positivevalue and when only the mean µ = E[X] and variance σ2 = var(X) of the distributionof X are known. Naturally, since X − µ ≥ a > 0 implies that |X − µ| ≥ a, it followsfrom Chebyshev’s inequality that

P(|X − µ| ≥ a) ≤ σ2

a2 when a > 0

Proposition 8.5.1. One-sided Chebyshev Inequality. If X is a random variable withmean 0 and finite variance σ2, then, for any a > 0,

P(X ≥ a) ≤ σ2

σ2 + a2

Answer. Let b > 0 and note that

X ≥ a is equivalent to X + b ≥ a + b

Hence,

P(X ≥ a) = P(X + b ≥ a + b)≤ P((X + b)2 ≥ (a + b)2)

where the inequality is obtained by noting that since a + b > 0, X + b ≥ a + b implies(X + b)2 ≥ (a + b)2. Upon applying Markov’s inequality, the preceding yields that

P(X ≥ a) ≤ E[(X + b)2](a + b)2 = σ2 + b2

(a + b)2

Letting b = σ2/a [which is easily seen to be the value of b that minimizes (σ2 +b2)/(a+b)2] gives the desired result.


Proposition 8.5.2. If E[X] = µ and Var(X) = σ2, then, for a > 0,

P(X ≥ µ + a) ≤ σ2

σ2 + a2

P(X ≤ µ − a) ≤ σ2

σ2 + a2

Proposition 8.5.3. Chernoff Bounds.

P(X ≥ a) ≤ e−taM(t) for all t > 0P(X ≤ a) ≤ e−taM(t) for all t < 0

Since the Chernoff bounds hold for all t in either the positive or negative quadrant, weobtain the best bound on P(X ≥ a) by using the t that minimizes e−taM(t).Proposition 8.5.4. Jensen’s Inequality. If f(x) is a convex function, then

E[f(X)] ≥ f(E[X])

provided that the expectations exist and are finite.

9 HomeworkGo back to Table of Contents. Please click TOC

This section attaches all the Homework solutions offered by the instructor. Pleasesee Exam Review for more guidance.

GU4203 - Introduction to Probability

Homework 1 Solutions

1 Problems

Question 1 - a) As there are two places for letters (of which there are 26) and five placesfor numbers (of which there are 10), it follows that there are 262 · 105 = 67, 600, 000 possiblelicense plates.

b) In the case when no letter or number can be repeated, we know that (e.g) after pickingthe first letter, we have 26 − 1 = 25 possible choices of letter for the second. By applyingthis same reasoning for the choice of numbers, we see that there are 26 · 25 choices for theletters and 10 · 9 · 8 · 7 · 6 choices for the numbers. Therefore the total possible number oflicense plates is the product of these numbers, which is 19, 656, 000.

Question 7 - a) In this case, as we only care about the possible ways of ordering 6 people,the total number of ways is 6! = 720.

b) In this case, we first realise that we can either have the boys sitting first or the girlssitting first (2 choices). Then, within each group of boys and girls, there are 3! = 6 choicesof ordering them. Therefore, the total number of ways of ordering the boys and girls in thisscenario is 2 · 6 · 6 = 72.

c) In this case, we note that the first boy can either be in any of the first four positions (4choices). Then as we have 3! possible ways of ordering the boys once the position of the firstboy has been fixed (and similarly so for the girls), there are 4 · 3! · 3! = 144 possible choices.

d) We can use a similar reasoning as to in part b) to obtain the answer of 72; we caneither have BGBGBG or GBGBGB, and then we have the 3! possible ways of ordering theboys/girls separately.

Question 10 - a) As there are no restrictions on how to seat people, there are 8! = 40, 320possible choices of seating plans.

b) If persons A and B must sit next to each other, then there are two choices of who sitsfirst, and seven choices of where the first person sits (giving 14 in total). Then as for theremaining six persons we have no restrictions on where they sit (so there are 6!), it followsthat in total, there are 14 · 6! = 10, 080 possible seating plans.

c) We can use similar reasoning to Q7d in order to deduce that the total number of seatingplans is 2 · 4! · 4! = 1152.

1


GU4203 HM1 Solutions

d) We can use similar reasoning to Q7c in order to deduce that the total number of seatingplans is 5 · 4! · 4! = 2880.

e) In this case, we have 4! possible ways of ordering the married couples, and for each marriedcouple, there are 2 ways of ordering how they sit. Therefore, the total number of seatingplans is 4! · 24 = 384.

Question 13 - If we have 20 people and everyone shakes hands with everyone else, thenwe have

(202

)= 380 total handshakes. This is because, for a single handshake, we need to

choose 2 people from the 20 to shake hands with each other.

Question 19 - a) To simplify exposition, if a person refuses to serve with someone else, wecall them ”naughty”. If two of the men refuse to serve together, then we need to considerthe total number of possibilities when either i) 0 of them are on the committee, or ii) 1 ofthem is on the committee. In the first case, we are selecting 3 women from 8 and 3 men from6, giving

(83

)(63

)combinations. In the second case, we are selecting 3 women from 8, 2 men

from the 4 men who are not naughty, and 1 man from the 2 naughty men, giving(83

)(42

)(21

)combinations. Therefore in total there are(

8

3

)(4

3

)+

(8

3

)(4

2

)(2

1

)= 896

possible committees.

b) Using a similar argument to the above, we see that there are(6

3

)(6

3

)+

(6

3

)(6

2

)(2

1

)= 1000


c) In this case, if neither of the naughty people are on the committee, there are(73

)(53

)choices

of committee. Now, if the naughty man is on the committee, there are(72

)(53

)choices; if the

naughty women is on the committee, there are(73

)(52

)choices. Therefore there are, in total,(

7

3

)(5

3

)+

(7

2

)(5

3

)+

(7

3

)(5

2

)= 910


Question 21 - As the hint says, any path consists of 7 total moves, 4 of which are to theright and 3 of which are up. The choice of when to make the 4 right moves (or alternativelythe 3 up moves) uniquely determines a path, meaning that there are

(74

)=(73

)= 35 total

paths.

Question 22 - We break the problem up into two parts; we first consider paths from A tothe circled point, and then paths from the circled point to B. As for the first there are

(42

)possible paths, and the second

(21

), it follows that there are

(42

)(31

)= 18 paths in total which

go through the circled point.

2



Question 30 - First consider the case when we are only concerned about sitting the Frenchand English together. In this case, we have 9 choices of where the first one sits, and 2 choicesof the order in which they set. For the remaining delegates there are 8! possible choices ofwhere they sit, giving 18 · 8! choices in total.

Now, to get the desired number of seating arrangements, it is enough to subtract the totalnumber of seating arrangements when both the French and English, and the Russians and USdelegates, are sitting next to each other. To calculate this, suppose the chairs are labelled 1through to 10, and the seating numbers of the countries are F, E, R and U respectively. Thenin order to determine the seating positions of the pairs FE and RU, it is enough to consideronly the smallest number of the pair who have the highest numbers, and the largest numberof the pair with the smallest numbers. If you have trouble seeing this, draw a diagram. Thiscorresponds to 8 · 7 possible choices (as the numbers we are picking from are 2 up to 9).We then have 2 choices each of the choice of ordering within a pair, giving 22 · 8 · 7 possiblechoices for these four delegates. As we don’t care about the placement of the remaining 6delegates, we have 6! possible choices of ordering for them. In total, this means we have22 · 8 · 7 · 6! = 22 · 8! possible seating arrangements.

Therefore the final answer is that there are 18 ·8!−22 ·8! = 14 ·8! = 564, 440 possible seatingarrangements.

Question 31 - For the first part, we can identify that this is the same problem as askingfor the total number of non-negative integer solutions to the equation x1 + x2 + x3 + x4 = 8,and so the total number of divisions is

(8+4−14−1

)=(113

)= 165. For the second part, we are

now after the total number of positive integer solutions, and so the total number of divisionsis(8−14−1

)=(73

)= 35.

2 Theoretical Exercises

Question 5 - Firstly, we want to determine the number of 0− 1 vectors (x1, . . . , xn) suchthat

∑ni=1 xi = j. As to do so, we simply need to select j of the n entries to be equal to 1

and the rest 0, it follows that there are(nj

). Therefore, as

n∑i=1

xi ≥ k ⇐⇒n∑

i=1

xi ∈ k, k + 1, . . . , n,

it follows that the total number of vectors which satisfy the criterion are∑n

i=k

(ni

).

Question 8 - Using the hint provided, there are(n+mr

)groups of size r in total. Furthermore,

the number of groups which have i men (and therefore r − i women) are(ni

)(mr−i

)for i =

0, 1, . . . , r. Therefore, if we do not care about the number of men in the group, we can sumover the i to get to the total number of possible groups, and so

r∑i=0

(n

i

)(m

r − i

)=

(n + m

r

).

3



Question 9 - This is a special case of the above formula by setting n = m, as then we seethat (

2n

n

)=

n∑i=0

(n

i

)(n

n− i

)=

n∑i=0

(n

i

)(n

i

)=

n∑i=0

(n

i

)2

.

Question 11 - Using the hint provided, we first consider the set [n] := 1, . . . , n. Wewant to calculate the number of subsets S of size k which have i as their highest number.If i is the highest number contained in S, then we must have that S ⊆ [i]. Therefore as i iscontained in S, and we have k− 1 remaining choices of numbers from [i− 1], there are

(i−1k−1

)choices in total.

To conclude, it is enough to realize that if we have a subset S ⊆ [n] of size k, then thehighest number contained in S could be any of k through to n, and so(

n

k

)=

n∑i=k

(i− 1

k − 1

).

Question 13 - This is an immediate consequence of using the binomial formula to expand0 = (1 − 1)n. Although this seems like a cute result and nothing more, it does have oneuseful interpretation - it tells us that the total number of subsets of even size is equal to thetotal number of subsets of odd size. (Why?)

4




1 Problems

Question 3 - We can describe the events as follows:

E ∩ F = (1, 2), (1, 4), (1, 6), (2, 1), (4, 1), (6, 1)E ∪ F = (x, y) : x + y is odd or at least one of x, y are 1F ∩G = (1, 4), (4, 1)E ∩ F c = (x, y) : x + y is odd and both x and y are > 1

E ∩ F ∩G = F ∩G (as G ⊂ E).

Question 6 - a) The sample space is Ω := (1, g), (1, f), (1, s), (0, g), (0, f), (0, s).

b) A is given by (1, s), (0, s).

c) B is given by (0, g), (0, f), (0, s).

d) A ∪Bc is given by (1, s), (0, s), (1, f), (1, g).

Question 11 - Let A = smokes cigarettes and B = smokes cigars, so P(A) = 0.28,P(B) = 0.07 and P(A ∩ B) = 0.05. Then for each part of the question, we are interested incalculating the following probabilities:

a) 1− P(A ∪B) = 1− (P(A) + P(B)− P(A ∩B)) = 0.7 after substituting the above valuesin;

b) P(Ac ∩B) = P(B)− P(A ∩B) = 0.02 after substituting the above values in.

Question 15 - As all poker hands are assumed to be equally likely, we are only reallyconcerned with calculating the possible number of hands with the desired property, as thenwe can divide by

(525

)in order to get the probability. Therefore, I will explain only how to

count the possible number of hands for each part of the question:

a) In this case, we have 4 possible choices of suit, and then 5 choices from 13 cards of thesame suit. This gives 4

(135

)possible choices in total.

b) Firstly, let us focus on the pair. For the pair, we have 13 choices of number, and then(42

)choices of suit, giving us 13·

(42

)possible pairs in total. For the remaining three cards, we need

1



to pick 3 denominations from the remaining 12 (of which there are(123

)), and we can choose

any suit for these three cards (4 for each card), giving us 43 ·(123

)possible choices for the

remaining three cards. Therefore the total number of hands with one pair is 13 ·43 ·(42

)·(123

).

c) We begin by focusing on the two pairs. We need to choose 2 denominations from 13 forthe two pairs, of which there are

(132

); for each pair, we then have

(42

)choices of the suit in

each case. As we require the final denomination to be different from the first two, there are52 − 2 · 4 = 44 cards from which we can pick the last card. Therefore the total number of

hands with two pairs is 44 ·(132

)(42

)2.

d) We begin by focusing on the three of a kind. We need to choose 1 denomination from13 and 3 suits from 4, giving 13 ·

(43

)possible combinations of a three of a kind. For the

remaining two cards, we need to select 2 different denominations from 12, and then choosethe suit of each card, giving us 42 ·

(122

)possible choices of the remaining two cards. Therefore,

the total number of three of a kind hands is 13 · 42 ·(122

)(43

).

e) We have 13 choices of the card of which we have four in our hand, and then(481

)possible

choices for the last card in our hand, giving 13(481

)possible hands in total.

The final numerical probabilities are then as follows: a) 0.198%, b) 42.3%, c) 4.75%, d)2.11%, e) 0.024%.

Question 25 - As the hint suggests, we want to compute the probability of the event En

where a 5 occurs on the n-th roll, yet neither a 5 or 7 occurs before then. We begin by notingthat the only way of rolling two dice to sum to 5 or 7 are as follows:

5 = 4 + 1 = 3 + 2 = 2 + 3 = 1 + 4;

7 = 6 + 1 = 5 + 2 = 4 + 3 = 3 + 4 = 2 + 5 = 1 + 6.

Therefore the probability that, on a single roll of a pair of die, neither a 5 or a 7 occursis 26/36, and the probability that a 5 occurs is 4/36. As consecutive rolls of the dice areindependent, it follows that P(En) = (25/36)n−14/36. The desired probability is then givenby

P(5 occurs before a 7) =∞∑n=1

P(5 occurs before a 7, first obtain a 5 or 7 on the n-th roll)

=∞∑n=1

P(En) =∞∑n=1

(26

36

)n−14

36

=2

5(after using the formula for geometric progressions).

Question 27 - We can model this problem by letting A and B draw all the balls from theurn, and then compute the probability that A draws the first red ball from the urn. If wewere to label the order in which balls are drawn from 1 to 10, this means that we want tocompute the probability that the first red ball drawn appears in an odd numbered position.

2



To compute the probability, we begin by noting that there are 10! possible ways in whichthe balls could be drawn from the urn, and that all of these ways are equally likely. Now, inorder for A to select a red ball first, the following can occur

The first red ball appears in the 1st position - there are 3 choices of red ball to beginwith and 9! for the remaining nine balls whose order we do not worry about, giving3 · 9! choices in total;

The first red ball appears in the 3rd position - there are 7 ·6 choices of white ball for thefirst two positions, 3 choices for the first red ball, and then 7! choices for the remainingseven balls whose order we do not worry about, giving 7 · 6 · 3 · 7! choices in total;

The first red ball appears in the 5th position - there are 7 · 6 · 5 · 4 choices of white ballfor the first four positions, 3 choices for the first red ball, and then 5! choices for theremaining five balls whose order we do not worry about, giving 7 · 6 · 5 · 4 · 3 · 5! choicesin total;

The first red appears in the 7th position - there are 7 · 6 · 5 · 4 · 3 · 2 choices of whiteball for the first six positions, 3 choices for the first red ball, and 3! choices for theremaining three balls whose order we do not worry about, giving 7 · 6 · 5 · 4 · 3 · 2 · 3 · 3!choices in total.

By summing over the number of choices in each of the four scenarios here, and diving by10!, we eventually find that the probability is equal to 7/12.

Question 33 - We begin by stating our assumptions - we assume that all of the elk areequally likely to be captured during both occasions, and that whether a elk is captured ornot the first time is independent of whether the same elk is captured or not the second time.Now, we note that we have

(204

)possible combinations of elk who are captured the second

time around. If 2 of the captured elk are tagged, we need to select 2 elk from the 5 originallycaptured (giving

(52

)) and 2 elk from the remaining untagged 15 (giving

(152

)). Under the

assumptions stated, the desired probability is given by(52

)(152

)/(204

)= 70/323 = 21.7%.

Question 42 - If two dice are thrown n times in succession, the probability that a doublesix never occurs is (35/36)n as successive rolls are independent and the probability that adouble six is not rolled on one occasion is 1− 1/36. Therefore, the probability that a doublesix is rolled at least once is 1− (35/36)n. If we want this probability to be at least 1/2, then

1−(

35

36

)n

≥ 1

2⇐⇒

(35

36

)n

≥ 1

2⇐⇒ n ≥ log(1/2)

log(35/36),

meaning the smallest number of n necessary is 25.

Question 53 - We use the Inclusion-Exclusion principle. Let Ai be the event that the i-thcouple sits next to each other; we are therefore interested in the probability 1−P(A1 ∪A2 ∪A3∪A4). Now, in total there are 8! possible arrangements of the 4 couples, all of which we canconsider to be equally likely. As the couples are indistinguishable, the Inclusion-Exclusionformula simplifies to

P(A1 ∪A2 ∪A3 ∪A4) = 4P(A1)− 6P(A1 ∩A2) + 4P(A1 ∩A2 ∩A3)− P(A1 ∩A2 ∩A3 ∩A4).

3



The probabilities in the above formula are then given by:

P(A1) = 2·7!/8! - We have 7 choices of the location of the first partner (and two choicesfor the order they sit in), and 6! choices of the positions of the remaining partners.

P(A1∩A2) = 22 ·6!/8! - We have 6! choices of the location for the two pairs of partnersand the remaining 4 individuals. We then have 2 choices for the order a couple sits infor both couples (giving 22 choices). Remember that the above procedure determinesthe positions of the third and fourth couples.

P(A1 ∩ A2 ∩ A3) = 23 · 5!/8! - We have 5! choices of location for the three pairs ofpartners and the remaining 2 individuals. We then have 23 choices in total for theorder in which the first, second and third couples sit in.

P(A1 ∩ A2 ∩ A3 ∩ A4) = 24 · 4!/8! - At this point, we are only concerned with how wecan order four couples next to each other (4! in total), and how we can arrange thepartners in each couple (2 per married couple, giving 24 in total).

Substituting these in to the above formula, and then subtracting it from 1, gives a finalanswer of 12/35 = 34.3%.

Question 54 - We use the Inclusion-Exclusion principle. Let S be the event that a bridgehand is void of a spade, and similarly define events C, D and H for clubs, diamonds andhearts respectively. We want to compute the probability P(S ∪ C ∪D ∪H). Now, as all ofthe suits are equally likely, we have e.g P(C∩D) = P(S∩H), and so the Inclusion-Exclusionformula simplifies down to (for example)

P(S ∪ C ∪D ∪H) = 4P(S)− 6P(S ∩H) + 4P(S ∩H ∩D).

Note that P(S ∪ C ∪D ∪H) = 0 as a hand cannot be devoid of all the four suits. Now, asall bridge hands are equally likely (giving

(5213

)in total), we see that

P(S) =(3913

)/(5213

)as we need to choose 13 cards from the 39 cards which are not spades;

P(S ∩H) =(3913

)/(5213

)as we need to choose 13 cards from the 26 cards which are not

spades or hearts;

P(S ∩H ∩D) =(1313

)/(5213

)as we need to choose 13 cards from the 13 cards which are

not spades, hearts or diamonds.

Substituting these into the above formula then gives a probability of approximately 5.1%.

2 Theoretical Exercises

Question 5 - We want to find a disjoint collection of Fi such that ∪mi=1Fi = ∪mi=1Ei for allm ≥ 1, given a (potentially countable infinite) sequence of events Ei. Note that the m = 1case tells us that F1 := E1. For the m = 2 case, note that we can write

F1 ∪ F2 = E1 ∪ E2 = E1 ∪ (E2 ∩ Ec1)

4



so both the left and right hand side are disjoint unions; as F1 = E1, we therefore can chooseF2 := E2 ∩ Ec

1. If we keep repeating this, we begin to see a pattern forming from which wedecide to choose

Fi := Ei ∩

(i−1⋂j=1

Ecj

).

To prove that this has the desired properties, first note that if i < j then Fi∩Fj ⊆ Ei∩Eci = ∅,

so the Fi are pairwise disjoint. Then in order to show that ∪mi=1Fi = ∪mi=1Ei for all m ≥ 1, we

do so by induction. The m = 1 case is immediate. Then if the statement is true for m = n,we see that it is true for m = n + 1 as

n+1⋃i=1

Fi =

(n⋃

i=1

Fi

)∪ Fn+1 =

(n⋃

i=1

Ei

)∪

(En ∩

n⋂i=1

Eci

)

=

(n⋃

i=1

Ei ∪ En+1

)∩

(n⋃

i=1

Ei ∪

(n⋃

i=1

Ei

)c)

=n⋃

i=1

Ei ∪ En+1 =n+1⋃i=1

Ei,

where we have used de-Morgan’s laws and the distributivity properties of unions/intersec-tions.

Question 11 - Bonferroni’s inequality follows as a simple consequence of the fact thatprobabilities are bounded above by 1, and then some rearranging:

1 ≥ P(E ∪ F ) = P(E) + P(F )− P(E ∩ F ).

Question 12 - As the event of interest is a disjoint union of the events E ∩F c and F ∩Ec,we get that the desired probability is

P(E ∩ F c) + P(F ∩ Ec) = P(E)− P(E ∩ F ) + P(F )− P(E ∩ F )

= P(E) + P(F )− 2P(E ∩ F ).

5




1 Problems

Question 1 - Let A be the event that at least one die rolls a six and B be the event thatthe two die rolled are different. Then P(B) = 5/6 (as we simply need the second die to beone of the five possible values which is different from that obtaining by the first die) and

P(A ∩B) = P(1st die = 6, 2nd die 6= 6) + P(1st die 6= 6, 2nd die = 6) = 2 · 1

6· 5

6.

Therefore the desired probability is P(A|B) = P(A ∩B)/P(B) = 1/3.

Question 4 - Let S be the sum of the value of the two dice, and A be the event that atleast one of the die lands on a 6. To compute the conditional probabilities, as the dice rollsare equally likely, it suffices to calculate the proportion

number of dice rolls which sum to S = i and contain one six

number of dice rolls which sum to S = i.

These can be calculated simply by writing out the possible dice roll combinations which sumto S = i, and then counting the total number and the number which contain at least onesix. The desired probabilities are then given as follows:

P(A|S = i) = 0 for 2 ≤ i ≤ 6;

P(A|S = 7) = 1/3;

P(A|S = 8) = 2/5;

P(A|S = 9) = 1/2;

P(A|S = 10) = 2/3;

P(A|S = i) = 1 for 11 ≤ i ≤ 12.

Question 5 - On the first pick, we have a probability of 6/15 of picking a white ball. Thenfor the second pick, we have 5 white balls and 9 black balls, so the probability of pickinga white ball now is 5/14. For the third pick, we have 4 white balls and 9 black balls, sothe probability of picking a black ball is 9/13. Finally, for the last pick, we have 4 whiteballs and 8 black balls, so the probability of picking a black ball is 8/12. Multiplying thesetogether gives the desired probability, which is equal to 6/91 after some simplification.

1



Question 6 - Given that the sample drawn contains exactly 3 white balls (and so only 1black ball), the black ball is equally likely to be in any of the 4 positions. In other words,conditional on the sample drawn, the probability that the i-th ball is white is equal to thatof whether it is black, and so the answer (in both cases) is 1/2.

Question 7 - Here we suppose that the two children are male or female with equal proba-bility and independently of each other. Now, given what we know, the probability of interestis

P(one boy, one girl | at least one boy) =P(one boy, one girl)

P(at least one boy)=

1/2

3/4=

2

3.

Question 10 - This can either be done by using the definition of conditional probability,or (as we do) by employing a symmetry argument. This allows us to say that it is equivalentto consider the conditional probability as if the second and third draws from the deck wereactually the first and second, and the first draw as being the third after two spades weredrawn. If you do not believe this immediately, let Ai be the event that the i-th draw fromthe deck is a spade, and note that

P(A1|A2, A3) =P(A1 ∩ A2 ∩ A3)

P(A2 ∩ A3)=

P(A1 ∩ A2 ∩ A3)

P(A1 ∩ A2)= P(A3|A1, A2).

The latter probability is then given by 11/50, as there are 11 spades remaining from 50 cards.

Question 14 - a) The probability of the first ball selected being black is 5/12. Afterwards,there are 7 black balls and 7 white balls, meaning the probability of the second ball selectedbeing black is 7/14. Then there are 9 white balls and 7 white balls, meaning the probabilityof the third ball selected being white is 7/16. Finally, there are 9 white balls and 9 black balls,so the probability that the fourth and last ball selected being white is 9/18. Multiplyingthese four probabilities gives the desired result of 35/768.

b) There are two ways of approaching this problem. Letting W represent a white ball, andB a black ball, we can recognize that the probability we are interested in is equal to

P(WWBB) + P(WBWB) + P(BWBW ) + P(BWWB) + P(WBBW ) + P(BBWW ).

We can then either compute each of these probabilities by hand and note that they are allequal to 35/768, or argue by symmetry that they are all equal and so by part a) they are allequal to 35/768. In either case, we see that the desired probability is 210/768 = 0.273.

Question 15 - Let E be the event that a pregnant women has an ectopic pregnancy, and Sbe the even that they are a smoker. Extracting information from the question, we see thatP(E|S) = 2P(E|Sc) and P(S) = 0.32. Then by Bayes’ theorem, we find that

P(S|E) =P(E|S)P(S)

P(E|S)P(S) + P(E|Sc)P(Sc)

=2P(S)

2P(S) + 1− P(S)=

0.64

1.32=

32

66= 0.4848.

2



Question 19 - a) Let A be the event that a person attends the party, W be the eventthat this person is a women, and M = W c be the event that this person is a man. Then byBayes’ theorem,

P(W |A) =P(A|W )P(W )

P(A|W )P(W ) + P(A|M)P(M)

=0.48 · 0.38

0.48 · 0.38 + 0.37 · 0.62= 0.443.

As the question asks for us to report a percentage, the answer is that 44.3% of the attendeesat the party were women.

b) By the law of total probability, we have that

P(A) = P(A|W )P(W ) + P(A|M)P(M) = 0.48 · 0.38 + 0.37 · 0.62 = 0.412,

and so 41.2% of the class attended the party.

3




1 Problems

Question 23 - a) Let R be the event that a red ball is transferred from urn I to urn II,and W be the event that it is a white ball instead. Then we know that P(R) = 2/3 andP(W ) = 1/3. Let A be the event that the ball selected from urn II is white. Then by thelaw of total probability, we have that

P(A) = P(A|W )P(W ) + P(A|R)P(R) =2

3· 1

3+

1

3· 2

3=

4

9.

b) By Bayes’ theorem, we see that

P(W |A) =P(A|W )P(W )

P(A)=

23· 13

49

=1

2.

Question 27 - Honestly, this question is worded really badly. On a brief philosophicalnote, either method could be used to estimate this quantity; the point is that one is betterthan the other. I could roll a die and tell you that the face-up value is an ”estimate” ofthe average number of workers in a car. Of course, as the two processes are completelyindependent, this would be a useless estimate.

Anyways, the second method is the better way of estimating this quantity. As we areinterested in the average number of workers per car, we want to sample from the cars asthe number of workers inside it is a property of the cars. The first method is bad as it hasthe possibility from over-sampling from cars with a large number of workers in them; if wesample multiple people from the same car with a large number of workers inside of it, wewill over-estimate the average proportion of workers per car.

Question 32 - Let E be the event that the eldest child in the family is chosen, and let Fj

be the event that the family selected has j children. Then by Bayes’ theorem, we have that

P(Fj|E) =P(E|Fj)P(Fj)∑4i=1 P(E|Fi)P(Fi)

=pj/j∑4i=1 pi/i

.

Using this formula then gives the answers a) 0.24, b) 0.18; nothing changes when repeatingthe calculation for when the randomly selected child is the youngest, so the answers are thesame again.

1



Question 43 - By Bayes’ theorem, the probability P(two headed coin|heads) is given by

P(heads|two headed)P(two headed)

P(heads|two headed)P(two headed) + P(heads|fair)P(fair) + P(heads|biased)P(biased),

so substituting in the various quantities gives us that the solution is

13· 1

13· 1 + 1

3· 12

+ 13· 34

=4

9.

Question 44 - This problem is similar to the Monty Hall problem. The jailer’s reasoningis faulty, as by revealing that one of his fellow prisoners is to be set free, the probability ofthe other fellow prisoner being executed has risen to 2/3. However, his own probability hasstayed the same at 1/3.

Question 45 - By Bayes’ theorem,

P(fifth coin|H) =P(H|fifth coin)P(fifth coin)∑10i=1 P(H|i-th coin)P(i-th coin)

=510· 110∑10

i=1i10· 110

=1

11.

Question 49 - Let C be the event that the patient has cancer, and let E be the event thatthe test indicates an elevated PSA level. Writing p := P(C), we obtain by Bayes’ theoremthat

P(C|E) =P(E|C)P(C)

P(E|C)P(C) + P(E|Cc)P(Cc)=

0.268p

0.268p + 0.135(1− p)

P(C|Ec) =P(Ec|C)P(C)

P(Ec|C)P(C) + P(Ec|Cc)P(Cc)=

0.732p

0.732p + 0.865(1− p).

Therefore if p = 0.7, we have a) 0.8224, b) 0.6638; if p = 0.3, we have a) 0.4597, b) 0.2661.

Question 51 - a) Let R be the event that the worker receives a job offer. Then by the lawof total probability, we have that

P(R) = P(R|strong)P(strong) + P(R|moderate)P(moderate) + P(R|weak)P(weak)

= 0.8 · 0.7 + 0.4 · 0.2 + 0.1 · 0.1 = 0.65.

b), c) These problems boil down to using Bayes’ theorem twice by noting that

P(A|R) =P(R|A)P(A)

P(R), P(A|Rc) =

(1− P(R|A))P(A)

1− P(R)

for A ∈ strong,moderate,weak. The desired probabilities are then, in order for which theyare asked for by the textbook, 56/65, 8/65, 1/65, 14/35, 12/35, 9/35.

Question 55 - Let x be the number of sophomore girls present. Then as the class and sexonly take two values each, it is in fact equivalent to check that the events of being a boy (B)and being a first-year (F ) are independent. Then as

P(B,F ) =4

16 + x, P(B) =

10

16 + x, P(F ) =

10

16 + x,

2



it follows that B and F are independent if and only if

4 =100

16 + x⇐⇒ x = 9.

Question 64 - Strategy a) gives a probability p of getting the correct answer. For strategyb), we have by the law of total probability that

P(win) = P(win|both correct)p2 + P(win|only one correct)2p(1− p)+

P(win|neither correct)(1− p)2 = p2 + p(1− p) = p.

Therefore both strategies have the same chance of being successful. Here’s a fun thing tothink about - is there a better strategy than either a) or b)?

Question 70 - Let C be the event that the queen is a carrier, and A be the event that thethree princes do not have the disease. Then by Bayes’ theorem,

P(C|A) =P(A|C)P(C)

P(A|C)P(C) + P(A|Cc)P(Cc)=

18· 12

18· 12

+ 1 · 12

=1

9.

Then if there is a fourth prince, if the queen is a carrier (which is probability 1/9 using ourknowledge), then there is a 1/2 probability of the prince having haemophilia; if the queen isnot a carrier, then there is zero chance of the prince having haemophilia. Therefore by thelaw of total probability, the probability that the fourth prince has haemophilia is 1/18.

3




1 Problems

Question 4 - Firstly, note that X can only take values in 1, . . . , 6 as the largest value ofX can occur only when the five men have the ranks 1 − 5 and the women have the ranks6− 10, meaning that X = 6. This argument also explains why

P(X = 6) =5

10· 4

9· 3

8· 2

7· 1

6=

1

252,

as we need to select which women to place in rank 6, followed by rank 7 and etc. Now, notethat if X = i, we know that men take all the positions 1, . . . , i − 1 and a women takes thei-th rank. Therefore, by the usual counting arguments we find that P(X = 1) = 1/2, andfor i = 2, . . . , 5,

P(X = i) =

(i−1∏j=0

5− j

10− j

)· 5

10− i + 1.

Evaluating this expression tells us that P(X = 2) = 5/18, P(X = 3) = 5/36, P(X = 4) =10/168, P(X = 5) = 5/252.

Question 10 - By Bayes’ theorem, we have for i ∈ 1, 2, 3 that

P(X = i|X > 0) =P(X = i,X > 0)

P(X > 0)=

P(X = i)

P(X = 1) + P(X = 2) + P(X = 3).

Substituting in the p(i) then tells us that P(X = 1|X > 0) = 39/55, P(X = 2|X > 0) = 3/11,P(X = 3|X > 0) = 1/55.

Question 17 - a) Recall that for a random variable X with distribution function F , P(X =i) = F (i) − limx↑i F (x). If you find this notation confusing, then as F is increasing, it issufficient to consider the limit

P(X = i) = F (i)− limn→∞

F

(x− 1

n

)

1



instead. Applying this result tells us that

P(X = 1) = F (1)− limx↑1

F (x) =1

2− 1

4=

1

4,

P(X = 2) = F (2)− limx↑2

F (x) =11

12− 3

4=

1

6,

P(X = 3) = F (3)− limx↑3

F (x) = 1− 11

12=

1

12.

b) We use that

P(

1

2< X <

3

2

)= P

(X <

3

2

)− P

(X ≤ 1

2

)= lim

x↑32

F (x)− F

(1

2

)=

(1

2+

1

8

)− 1

8=

1

2.

Question 20 - a) Note that we leave with some winnings (i.e X > 0) if and only if we either

Win on the first roulette spin (which occurs with probability 1838

), or

Lose on the first roulette spin, and then win on the subsequent two roulette spins (as

the roulette spins are independent, this occurs with probability 2038·(1838

)2.)

The probability that X > 0 is the sum of these probabilities, which is approximately 0.5918.

b) Really, this depends on what you consider to be a winning strategy. If you’re simplyconcerned with playing the game only once, and you care only if you win or don’t win (andnot about the amount of money you win or lose), then you can look at the above probability,note that it is larger than 0.5, and conclude that as you’re more likely to win than to notwin, you believe it is a “winning strategy”. However, if you care about how much you lose,then you should really consider the expectation E[X]; in the next part of the question, wesee that this is negative, and therefore we would not consider it to be a “winning strategy”.

The take home message of this question is that you need to make a judgement of what makesa “winning strategy”, and then use some property of the distribution of X in order to helpyou decide whether the strategy you use is actually a “winning” one or not. (This is a verybrief and basic introduction to a branch of statistics known as decision theory.)

c) Note that the probability we computed in a) is also P(X = 1). If we do not win, then wemust lose either

$1 - this occurs only if we lose the first game of roulette, and then win only one of thetwo subsequent games of roulette, which occurs with probability 20

38·(2 · 18

38· 2038

);

$3 - this occurs only if we lose all three games of roulette, which occurs with probability(2038

)3.

2



The expectation of X is therefore given by

E[X] = P(X = 1)− P(X = −1)− 3P(X = −3)

= 1838

+ 2038·(1838

)2 − 2038·(2 · 18

38· 2038

)− 3 ·

(2038

)3 ≈ −0.108.

Question 21 - a) Intuitively, we would expect that E[X] is larger as the student is morelikely to have came from a bus carrying a large number of students, whereas the bus driver isselected randomly and so there is no bias towards buses with a larger than average numberof students.

b) We have that

E[X] = 4040

138+ 33

33

138+ 25

25

138+ 50

50

138≈ 39.28,

E[Y ] =40 + 33 + 25 + 50

4= 37.

(Note that E[X] is indeed larger than E[Y ] as expected.)

Question 26 - a) Let X be the random variable representing the number of questionsrequired for the correct value of X to be guessed. Then as the number chosen is done sorandomly from 1 to 10, and the i-th question we ask is “Is it i?”, it follows that P(X = i) = 1

10.

Therefore

E[X] =10∑i=1

iP(X = i) =1

10

10∑i=1

i =11

2.

b) The easiest way to consider this problem is to draw a tree diagram. As drawing this inTeX would be a pain, I’ll describe the process of what it should like:

For the first question, we can ask whether the chosen number is in a list of five numbersor not; after this process, we will always be left with five numbers remaining.

For the second question, we can ask whether the chosen number is in a set of three num-bers or a set of two numbers; the chosen number is in the set of three with probability3/5 and in the set of two with probability 2/5.

For the third question, we then have two possibilities:

– If we narrowed down to three numbers from the last question, we then have toask whether the number belongs to a set of two (with probability 2/3) or a set ofone (with probability 1/3). If the number belongs to a set of one, then we knowwhat the chosen number is, and so we’re done.

– If we narrowed down to two numbers from the last question, then we simply needto choose a number to ask whether it was the randomly chosen number; in eithercase, after asking this question we know what the chosen number is, and so we’redone.

3



If a fourth question is necessary, then we must have have had been left with threenumbers after the second question, and two after the third; therefore, for the samereasoning as above, this final question tells us what the chosen value was.

Combining the above scenarios, we see that the expected number of questions asked is

3

(2

5+

3

5· 1

3

)+ 4 · 3

5· 2

3=

17

5.

(A fun thing to think about for five minutes, and nothing more - is this the optimal strategy?)

Question 28 - Let X be the number of defective items in the sample. As 4 of the items aredefective and 16 are not, and they are all equally likely to be sampled, we know that

P(X = i) =

(4i

)(163−i

)(203

) ,

where we use the notation that(n0

)= 1. Therefore it follows that

E(X) =3∑

i=0

i

(4i

)(163−i

)(203

) =3

5.

Question 33 - Let N be the number of newspapers which the newsboy buys, and X be hisdaily demand, so X ∼ B(10, 1

3). The newsboys expected profit is given by

f(X,N) = 0.15 minX,N − N

10.

(as the newsboy can only sell up to N papers when he buys N). We then want to find thevalue of N which maximizes g(N) = E[f(X,N)]. One can calculate that

g(N) = E[f(X,N)] =10∑i=0

P(X = i)

(0.15 mini, N − N

10

)

= 0.15

(N∑i=0

iP(X = i) + NP(X > N)

)− N

10.

At this point, the easiest approach is to then simply get a computer to calculate this for thevarious values of N , at which point we see that the maximum is attained by N = 3.

Question 35 - Note that if you draw two marbles randomly from the box, then they areof the same color with probability p = 4/9. Letting X be the random variable representingthe amount you win, we see that P(X = 1.1) = p = 1− P(X = −1), and so

E[X] = 1.1 · p− (1− p) = 2.1 · p− 1 = − 6

90

E[X2] = (1.1)2 · p + (−1)2 · (1− p) = 1 + 0.21 · p =82

75

Var(X) = E[X2]− (E[X])2 =82

75−(

6

90

)2

≈ 1.089.

4



Question 38 - Note that we can write

E[(2 + X)2] = E[4 + 2X + X2] = 4 + 2E[X] + Var(X) + (E[X])2 ,

Var(4 + 3X) = 9Var(X)

by using properties of the expectation and variance, and so it follows that a) E[(2+X)2] = 14and b) Var(4 + 3X) = 45. (For the first part one can also use the variance formula directlyapplied to the random variable Y = 2 + X to obtain the result).

Question 40 - Recognize that the random variable X corresponding to the number of correctanswers obtained by guessing is distributed as a B(5, 1

3) random variable, and therefore the

desired probability is

P(X ≥ 4) =

(5

4

)(1

3

)42

3+

(1

3

)5

=11

243.

Question 42 - Let X be the number of questions correctly answered by both A and B,and Y be the number of questions correctly answered by at least one of A and B. For anindividual question, as we know that A and B correctly answer independently of each other,the probability that they both get the correct answer is 4

10· 7

10= 7

35. By the Inclusion-

Exclusion formula, we therefore know that the probability that at least one of A and Banswer a question correctly is 4

10+ 7

10− 7

35= 9

10. As A and B answer a question correctly

independently of their performances on the other questions, we therefore know that X isdistributed as a B(10, 7

35) random variable and Y as a B(10, 9

10) random variable. Therefore

a) E[X] = 10 · 735

= 2,

b) Var(Y ) = 10 · 910· 110

= 910

.

5




1 Problems

Question 49 - a) We know that for coin 1 the number of heads is B(10, 0.4) distributedand for coin 2 the number of heads is B(10, 0.7) distributed, and so

P(exactly 7 heads) = P(exactly 7 heads|coin 1)P(coin 1) + P(exactly 7 heads|coin 2)P(coin 2)

=1

2

(10

7

)(0.470.63 + 0.770.33

)b) We know that the first coin flip is heads with probability (0.4 + 0.7)/2 = 0.55, and theprobability that exactly six of the last nine coins are heads with probability

1

2

(9

6

)(0.460.63 + 0.760.33

)(using the same reasoning as in the first part). Therefore the desired probability is given by

P(exactly 7 heads|first coin heads) =P(six of last nine are heads)

P(first coin heads)

=12

(96

)(0.460.63 + 0.760.33)

0.55.

Question 52 - As the number of plane crashes is small relative to the number of totalflights, an appropriate approximate distribution for the number of plane crashes per month,say X, is given by a Poisson(3.5) distribution. Therefore the answer to a) is given by

P(X ≥ 2) = 1− P(X ≤ 1) = 1− 4.5e−3.5

and the answer to b) is given by P(X ≤ 1) = 4.5e−3.5

Question 54 - We may assume that the number of cars abandoned weekly on the highway,given by X, is approximately Poisson(2.2)) distributed. Therefore

a) - P(X = 0) = e−2.2

b) - P(X ≥ 0) = 1− P(X ≤ 1) = 1− 3.2e−2.2.

1



Question 56 - This is the birthday problem from earlier in the course, although nowwe want to use a Poisson approximation in order to approximate the number of peoplerequired. Let X be the number of people, out of n randomly selected, who have the samebirthday as you; although this is actually distributed as a B(n, 1/365), we approximate it bya Poisson(n/365). Therefore, the probability that at least one person has the same birthdayas you is approximately 1 − e−n/365; this is greater than one half whenever n ≥ 365 log(2),i.e whenever n ≥ 253. (This is much greater than the actual answer of n ≥ 23 when usingthe binomial, which you may remember from before.)

Question 63 - As people enter the casino at a rate of 2.5 per 5 minutes, we have that thenumber of people who enter the casino in the given 5 minute period (say X) is distributedas a Poisson(2.5). Therefore the quantities of interest are

a) - P(X = 0) = e−2.5

b) - P(X ≥ 4) = 1− P(X ≤ 3) = 1− e−2.5

(1 + 2.5 +

2.52

2+

2.53

3!

).

Question 71 - Smith’s probability of winning on a given roulette spin is 12/38. Therefore,as the roulette spins are independent, we know that the probability he loses five in a row is(26/38)5 and the probability that he wins for the first time on his fourth bet (i.e he has the

(W)in/(L)oss pattern LLLW) is 1238

(2638

)3.

Question 75 - Let Y be the number of coin flips taken until 10 heads have been obtained.Then as X = Y − 10, we can first find the probability mass function of Y . Now, if x ≥ 10,we know that for y ≥ 10,

P(Y = y) = P(10 heads, y − 10 tails, last coin heads) =

(y − 1

9

)12

9 12

y−10,

and therefore

P(X = x) =

(x + 9

9

)12

x+9.

Question 78 - The probability of getting exactly two black and two white balls is givenby(42

)(42

)/(84

)= 18/35 (as all of the balls are equally likely to be selected). As we repeat

the process until one success, and subsequent re-draws from the urn are independent of eachother, the total number of selections X is distributed as a Geometric(18/35), and so theprobability of interest is

P(X = n) =

(17

35

)n−118

35.

Question 79 - a) If we pick 0 defective items, then we need to have chosen 10 items fromthe non-defective 94, and so the probability that X = 0 is given by

(9410

)/(10010

).

b) We know that P(X > 2) = 1 − P(X = 0) − P(X = 1) − P(X = 2). Therefore, usingthe counting arguments which you should be used to employing by now, we know that this

2



probability is given by

1−(9410

)(10010

) − (949 )(61)(10010

) − (948 )(62)(10010

) .

Question 82 - As the lots are sampled randomly, we know that the number of defectivetransistors in the sample X is distributed as a B(4, 0.1) random variable. Therefore

P(rejected) = 1− P(4 not faulty) = 1− (0.9)4.

Question 84 - a) Let Xi be the indicator function for whether the i-th box contains noballs (Xi = 1) or not (Xi = 0). Then the average number of boxes which contain no balls isgiven by

E

[5∑

i=1

Xi

]=

5∑i=1

E[Xi] =5∑

i=1

P(Xi = 1) =5∑

i=1

(1− pi)10.

b) Now let Yi be the indicator function for whether the i-th box contains exactly one ball.Then the average number of boxes which contain exactly one ball is given by

E

[5∑

i=1

Yi

]=

5∑i=1

E[Yi] =5∑

i=1

P(Yi = 1) =5∑

i=1

(10

1

)pi(1− pi)

9.

Question 85 - Let Xi be the indicator function for if the i-th type of coupon appears atleast once in the set of n coupons. Then using similar ideas and reasoning to the above, theexpected number of distinct types of coupons which appear in the set is given by

E

[k∑

i=1

Xi

]=

k∑i=1

E[Xi] =k∑

i=1

P(Xi = 1) =k∑

i=1

(1− (1− pi)n).

3




1 Problems

Question 4 - a) This is simply computing the integral

P(X > 20) =

∫ ∞20

10

x2dx =

1

2

b) As f(y) = 0 for y ≤ 10, F (y) = 0 for y ≤ 10. For y ≥ 10, we then have that

F (y) =

∫ y

10

f(y) dy = 1− 10

y.

c) The probability that a device will function for at least 15 hours is 1−F (15) = 23. Assuming

that the lifetime of the different devices are independent of each other, the probability beingasked for is P(Y ≥ 3) where Y ∼ B(6, 2

3), and so the probability is

6∑i=3

(6

i

)(2

3

)i(1

3

)6−i

.

Question 7 - As f is a density function, we know that∫ 1

0

f(x) dx = 1 ⇐⇒ a+b

3= 1,

and as E[X] = 35, we know that∫ 1

0

xf(x) dx =3

5⇐⇒ a

2+b

4=

3

5.

Solving for a and b gives a = 35

and b = 65.

Question 11 - Firstly, note that as we are only concerned in the ratio of the shorter tothe longer segment, without loss of generality we may assume that L = 1. Let X ∼ U [0, 1].Now, as this ratio is given by min X

1−X ,1−XX (consider when X > 1/2 and X ≤ 1/2), we

are interested in

P(min X

1−X ,1−XX < 1

4

)= 1− P

(min X

1−X ,1−XX > 1

4

)= 1− P

(X

1−X > 14, 1−X

X> 1

4

)= 1− P

(15< X < 4

5

)= 2

5.

1



Question 13 - a) Let X ∼ U [0, 30]. Then the desired probability is simply P(X > 10) = 23.

b) The probability we are interested in is

P(X > 25|X > 15) =P(X > 25)

P(X > 15)=

1612

=1

3.

Question 18 - Suppose that X = 5 + σZ where Z ∼ N(0, 1); we want to find what thevalue of σ2 is approximately. Now, as

0.2 = P(X > 9) = P(Z > 4

σ

)and we know from the normal tables that P(Z < 0.84) ≈ 0.8, it follows that 4

σ≈ 0.84, so

σ2 ≈ 22.66.

Question 20 - Let X be the number of people in the sample who are in favor of the proposedtax rise, so then X ∼ B(100, 65

100), so X has mean 65 and standard deviation of approximately

4.77. Letting Z ∼ N(0, 1), we thus know that X ≈ 65 + 4.77Z in distribution. Therefore wefind that (noting that we apply the continuity correction each time)

a) P(X ≥ 50) = P(X > 49.5) ≈ P(Z > −3.25) ≈ 0.9994,

b) P(60 ≤ X ≤ 70) = P(59.5 ≤ X ≤ 70.5) ≈ P(−1.15 ≤ Z ≤ 1.15) = 2P(Z ≤ 1.15)− 1 =0.75,

c) P(X ≤ 75) = P(X ≤ 74.5) ≈ P(Z ≤ 1.99) ≈ 0.977.

Question 23 - For the first part of the question, we want to approximate X ∼ B(1000, 16)

by a 10006

+√

500036Z random variable, where Z ∼ N(0, 1). The probability of interest is then

(after applying the continuity correction)

P(149.5 ≤ X ≤ 200.5) ≈ P(−2.87 ≤ Z ≤ 1.46) = Φ(2.87) + Φ(1.46)− 1 = 0.9258.

For the second part of the question, we now want to approximate X ∼ B(800, 15) by a

160 +√

128Z random variable, where Z ∼ N(0, 1). This time the probability of interest is(again after applying the continuity correction)

P(Z ≤ 149.5) ≈ P(Z ≤ −0.93) = 1− Φ(0.93) = 0.1762.

Question 25 - Let X denote the number of unacceptable items among the next 150 pro-duced, so X ∼ B(150, 0.05). We can approximate this by a 7.5 +

√7.125Z random variable

where Z ∼ N(0, 1), and so the desired probability is

P(X ≤ 10) = P(X ≤ 10.5) ≈ P(Z < 1.1239) = 0.8695.

Question 27 - If the coin was fair, we would expect the number of heads obtained from10,000 flips to be X ∼ B(10000, 1

2), which we can approximate as a 5000 + 50Z random

variable, where Z ∼ N(0, 1). Now, note that as

P(X ≥ 5800) = P(X ≥ 5799.5) ≈ P(Z > 15.99) ≤ 10−50(!!!!),

2



it is incredibly unlikely that we would observe the number of heads we did if the coin wasfair, and therefore this seems like an unreasonable assumption.

Question 29 - Let X be the number of the 1000 time periods for which the stock priceincreases. Then after the 1000 time periods, if s0 is the starting stock price, we know thatthe stock price after the 1000 time periods is

s0uXd1000−X = s0d

1000(ud

)X.

This is greater than 1.3s0 if and only if (after some rearranging and substituting in values)X ≥ 469.2. As X is an integer, we really need to consider the probability that X ≥ 470.Now, as X ∼ B(1000, 0.52) can be approximated by a 520 +

√249.6Z where Z ∼ N(0, 1),

we find thatP(X ≥ 469.5) ≈ P(Z ≥ −3.196) = 0.9993.

Question 32 - Let X ∼ Exp(12), so then

a) P(X > 2) = e−1,

b) P(X > 10|X > 9) = P(X > 1) = e−1/2 by the memoryless property.

Question 34 - If X ∼ Exp( 120

), then the probability of interest is

P(X > 30|X > 10) = P(X > 20) = e−1

by the memoryless property; if instead X ∼ U [0, 40], then we have that

P(X > 30|X > 10) =P(X > 30)

P(X > 10)=

1434

=1

3.

Question 38 - Firstly, we recall that the roots of the equation are both real if and only ifthe discriminant is non-negative, that is (after some simplification) Y 2 ≤ Y + 2. Now, asY ∈ (0, 5), this condition is equivalent to saying that Y ≤ 2, and so the desired probabilityis P(Y ≤ 2) = 3

5.

Question 40 - As ex is a strictly increasing function, we know that

FY (y) = P(Y ≤ y) = P(X ≤ log(y)) = log(y) =⇒ fY (y) =d

dylog(y) =

1

yfor 1 ≤ y ≤ e.

3




1 Problems

Question 1 - Let p(i, j) = P(X = i, Y = j). I trust at this point you can simply count thenumber of possibilities, so I’m just going to give the numerical answers as follows:

a) p(1, 2) = p(2, 4) = p(3, 6) = p(4, 8) = p(5, 10) = p(6, 12) = 136

, p(2, 3) = p(3, 4) =p(4, 5) = p(4, 6) = p(5, 6) = p(4, 7) = p(5, 7) = p(6, 7) = p(5, 8) = p(6, 8) = p(5, 9) =p(6, 9) = p(6, 10) = p(6, 11) = 2

36, p(i, j) = 0 otherwise.

b) p(i, j) = 136

if i < j, p(i, i) = i36

for 1 ≤ i ≤ 6.

c) p(i, j) = 236

if i < j, p(i, i) = 136

for 1 ≤ i ≤ 6.

Question 2 - a) Let p(i, j) = P(X1 = i,X2 = j). Then we know that

p(0, 0) =8

13· 712 =

56

156, p(0, 1) = p(1, 0) =

8 · 513 · 12

=40

156, p(1, 1) =

5

13· 4

12=

20

156.

b) Now let p(i, j, k) = P(X1 = i,X2 = j,X3 = k). Then we know that

p(0, 0, 0) =8 · 7 · 6

13 · 12 · 11=

28

143, p(1, 1, 1) =

5 · 4 · 313 · 12 · 11

=5

143

p(1, 0, 0) = p(0, 1, 0) = p(0, 0, 1) =8 · 7 · 5

13 · 12 · 11=

70

429

p(1, 1, 0) = p(1, 0, 1) = p(0, 1, 1) =8 · 5 · 4

13 · 12 · 11=

40

429.

Question 7 - From the setup of the question, we know that X1 and X2 are independent andidentically distributed ’Geometric(p)− 1’ random variables, and so their joint mass functionis simply the product of the individual mass functions:

fX1,X2(x1, x2) = fX1(x1)fX2(x2) = p2(1− p)x1+x2 for x1, x2 ≥ 0.

Question 11 - This is simply a multinomial probability, and so the answer is given by(5

2, 1, 2

)· 0.452 · 0.15 · 0.42.

1



Question 14 - Let X, Y ∼ U(0, L) be independent. The probability we are interested in is,for 0 < a < L,

P(|X − Y | < a) = P(Y < X < Y + a) + P(X < Y < X + a)

=2

L2

∫ L

0

∫ miny+a,L

y

1 dxdy

=2

L2

(∫ L−a

0

∫ y+a

y

1 dxdy +

∫ L

L−a

∫ L

y

1 dxdy

)= 1− L− a

L+

a

L2(L− a) =

a

L

(2− a

L

).

Question 15 - a) This arises simply as a consequence of

1 =

∫ ∫f(x, y) dxdy =

∫ ∫(x,y)∈R

c dxdy = c · (area of region R) .

b) This is because we can factor fX,Y (x, y) as follows:

fX,Y (x, y) =1

41[−1 ≤ x, y ≤ 1] =

(1

21[−1 ≤ x ≤ 1]

)(1

21[−1 ≤ y ≤ 1]

)= fX(x)fY (y),

so X and Y are independent U [−1, 1] random variable. c) This is simply the area of a circleof radius 1 times 1

4, so the probability is π

4.

Question 16 - a, b) As all of the points lie in the same semicircle if and only if they also liein a semicircle starting at some Pi, A is the union of the Ai. Furthermore, if we were to orderthe Pi by their angle (with respect to the positive real axis, for example, after centering thecircle at the origin), we see that all of the points cannot lie in two such of these semicirclessimultaneously. Therefore the Ai are mutually exclusive.

c) Given parts a) and b), we therefore know that

P(A) =n∑i=1

P(Ai) =n∑i=1

(1

2

)n−1= n

(1

2

)n−1,

where to calculate P(Ai), we note that we need to decide whether each point lies in thesemicircle or not (and that they are placed independently of each other.)

Question 17 - As all of the points are equally likely to be the middle point, the probabilityis 1

3.

Question 21 - a) Firstly, we note that f is non-negative everywhere, so it remains to checkthat f integrates over R2 to give 1. Indeed,∫ ∞

−∞

∫ ∞−∞

f(x, y) dxdy =

∫ 1

0

∫ 1−y

0

24xy dxdy

=

∫ 1

0

12y(1− y)2 dy = 12

∫ 1

0

y − 2y2 + y3 dy

= 12

(1

2− 2

3+

1

4

)= 1.

2



b) To compute the expectation, we note that

E[X] =

∫ ∞−∞

∫ ∞−∞

xf(x, y) dxdy =

∫ 1

0

∫ 1−y

0

24x2y dxdy

= 12

∫ 1

0

x2(1− x)2 dx

= 12

∫ 1

0

x2 − 2x3 + x4 dx =2

5.

c) As f(x, y) is symmetric in x and y, E[Y ] = E[X] = 25.

Question 22 - a) No, as we cannot write f(x, y) as a product of a function of x and afunction of y.

b) The density function of X is given by

fX(x) =

∫ 1

0

f(x, y) dy =

∫ 1

0

(x+ y) dy = x+1

2, 0 < x < 1.

c) The probability is given by

P(X + Y < 1) =

∫ 1

0

∫ 1−x

0

(x+ y) dydx =

∫ 1

0

x(1− x) +(1− x)2

2dx =

1

3.

Question 27 - We compute the c.d.f of X1/X2:

P(X1

X2

< a

)=

∫ ∞0

∫ ay

0

λ1e−λ1xλ2e

−λ2y dxdy

=

∫ ∞0

(1− e−λ1ay)λ2e−λ2y dy

= 1− λ2λ2 + aλ1

=λ1a

λ2 + λ1a.

We then see that

P (X1 < X2) = P(X1

X2

< 1

)=

λ1λ2 + λ1

.

Question 28 - a) Let X be the time at which MJ’s car is ready, and Y that of AJ’s; we wantto calculate P(X < Y ). Now, in order for this to occur, we must have that Y > t (whichoccurs with probability e−t). Then by the memoryless property, we know that Y |Y > t isalso Exp(1) distributed, and so then we know by symmetry that P(X < Y |Y > t) = 1

2.

Therefore

P(X < Y ) = P(X < Y |Y < t)P(Y < t) + P(X < Y |Y > t)P(Y > t) = 0 +1

2e−t =

1

2e−t.

b) As the time X at which MJ’s car is ready is Γ(2, 1) distributed (as it is the sum of twoindependent Exp(1) distributions), the probability of interest is P(X ≤ 2) = 1− 3e−2.

3




1 Problems - Chapter 6

Question 24 - a) As N is simply a geometric random variable with parameter 1 − p0, weknow that P(N = n) = pn−10 (1− p0).

b, c) We first note that P(N = n,X = j) = pjpn−10 , and so P(X = j) =

∑∞n=1 P(X =

j,N = n) =∑∞

n=1 pjpn−10 =

pj1−p0 . It is then simple to verify that P(X = n,X = j) = P(X =

n)P(X = j) given the above values.

d, e) Really, this is a question you should ask yourselves. I think both are intuitive, butthat’s me.

Question 35 - a) We want to calculate P(X1 = 1|X2 = 1), as we know that P(X1 = 0|X2 =1) = 1− P(X1 = 1|X2 = 1). Now, note that X1 is independent of X2, and so

P(X1 = 1|X2 = 2) = P(X1 = 1) =5

5 + 8=

5

13.

b) Using the same reasoning as above (that X1 and X2 are independent), we find thatP(X1 = 1|X2 = 1) = 5

13= 1− P(X1 = 0|X2 = 1).

2 Problems - Chapter 7

Question 7 - Let Xi, Yi and Zi be indicator random variables which equal 1 if a) bothchoose, b) neither choose, and c) exactly one chooses item i (respectively), and which equal0 otherwise. Further denote X =

∑10i=1Xi and similarly so for Y and Z; as Xi +Yi +Zi = 1,

we know that X + Y + Z = 10. Furthermore, we know that E[Xi] = P(Xi = 1) =(

310

)2and

E[Yi] = P(Yi = 1) =(

710

)2. Therefore

a) E[X] =∑10

i=1 E[Xi] = 10 · 9100

= 910

,

b) E[Y ] =∑10

i=1 E[Yi] = 10 · 49100

= 4910

,

c) E[Z] = 10− E[X]− E[Y ] = 4210

.

1



Question 9 - Let Xj be the indicator variable for whether the j-th urn is empty or not.Then

E[Xj] = P(ball i is not in urn j for all i ≥ j) =n∏

i=j

(1− 1

i

).

Therefore if X is the number of empty urns,

a) E[X] =∑10

j=1 E[Xj] =∑10

j=1

∏10i=j

(1− 1

i

),

b) P(X = 0) = P(ball j is in urn j for all j) =∏10

i=11i.

Question 21 - a) Let Xj be the indicator variable for whether the j-th day is exactly three

people’s birthday, for 1 ≤ j ≤ 365. Then as E[Xj] = P(Xj = 1) =(1003

) (1

365

)3 (364365

)97, we

see that the expected number of days which are birthdays of exactly three people are

E

[365∑j=1

Xj

]=

365∑j=1

E[Xj] = 365

((100

3

)(1

365

)3(364

365

)97).

b) Let Xj be the indicator variable for whether the j-th day is someone’s birthday (Xj = 1)

or not, for 1 ≤ j ≤ 365. Then as E[Xj] = P(Xj = 1) = 1− P(Xj = 0) = 1−(364100

)100, we see

that the expected number of distinct birthdays is given by

E

[365∑j=1

Xj

]=

365∑j=1

E[Xj] = 365

(1−

(364

100

)100).

Question 33 - a) E[(2+X)2] = Var(2+X)+(E[2 + X])2 = Var(X)+(2+E[X])2 = 5+32 =14,

b) Var(4 + 3X) = 9Var(X) = 45.

Question 37 - Let Wi denote the outcome of the i-th roll of the dice, so X = W1 +W2 andY = W1 −W2. Then

Cov(X, Y ) = Cov(W1 + W2,W1 −W2) = Cov(W1,W1)− Cov(W2,W2) = 0

as W1 and W2 are identically distributed.

Question 57 - Let N be the number of accidents in a given week, and Xj be the numberof workers injured in each accident, for 1 ≤ j ≤ N . Then as N and the Xj are independent,and the Xj have common mean, we have that

E

[N∑i=1

Xi

]= E[N ]E[X1] = 5 · 2.5 = 12.5.

Question 58 - a) Let X denote the number of coin flips required until both heads and tailshave appeared. To approach this, we condition on the face of the first flipped coin, so

E[X] = E[X|first coin heads]P(first coin heads) + E[X|first coin tails]P(first coin tails)

= pE[X|first coin heads] + (1− p)E[X|first coin tails].

2



Now, as X|first coin headsd= 1 + Geometric(1 − p) (as once the first coin is heads, we

want to count the number until the first tails arises) and similarly X|first coin headsd=

1 + Geometric(p), it therefore follows that

E[X] = pE[X|first coin heads] + (1− p)E[X|first coin tails]

= p

(1 +

1

1− p

)+ (1− p)

(1 +

1

p

)= 1 +

p

1− p+

1− p

p.

b) As the last flip is independent of all the previous flips, and the probability of coming upheads for each flip is p, the desired probability is p.

3



10 Exam ReviewGo back to Table of Contents. Please click TOC

A few words about exam-taking philosophy and TA responsibility.Exam-taking Philosophy.Coming from an environment that promotes test-bank strategy (a strategy based on

quantity of tests students take) and currently trained in an environment that promotestest-type strategy (an opposite kind of strategy as test-bank strategy that based ontypes of questions students take), I think all strategies can be laid out in the followingspectrum.

Left ⇐ · · · ⇒ RightTest-bank Strategy ⇐ · · · ⇒ Test-type Strategy

a.k.a Focus on Quantity ⇐ · · · ⇒ Focus on Quality

That being said, a student can be on the left or right or middle of the spectrum.To cover this spectrum (e.g. to fulfill everyone’s needs as much as possible), I willpoint out the guidelines first (covering what types of problems are fair games in theexam). Then I will lead discussion of a few problems in each type. There will be moreproblems students can do, however, I will have to leave that to students as the classhas limited amount of time.

10.1 1st MidtermThe following notions are important for this midterm.

1. Chapter 1. Counting Principle, Permutation, Combinations, Binomial Theorem(and its Propositions).

2. Chapter 2. Sample Space, Union, Intersection, Complement, Mutually Exclusive,Inclusion-Exclusion Identity

3. Chapter 3. Conditional Probability, Multiplication Rule of Probability, Bayes’sFormula, Denominator of Bayes’s Formula by Law of Total Probability.

PermutationThere are n people sitting in a row and this will give us n! different arrangements.Then on top of this the problem can build up premises:

A classical example (Homework #1) is the following.Consider delegates from 10 countries with R, F, E, U and rest of the 6 more countries

sitting in a row. We want to satisfy two premises (1) F and E are sitting together, i.e.FE or EF. (2) R and U are not sitting together.

Answer. The key is to work out (1) first and then subtract the compliment of (2) toobtain the answer.

(1) We want FE- - - - - - - -, - FE - - - - - - -, ..., - - - - - - - - FE so this is total of 9possible outcomes and with FE and EF being different arrangement. The rest of the 8countries may sit in different arrangement. That is, (2)(9)(8!) possibilities.

(2) We want the compliment of (2). (2) says R and U are not sitting together.The compliment of this event is referring to the situation while R and U are indeedsitting together. While satisfying (1) the same time, we have compliment of (2) tobe FERU, FEUR, EFRU, EFUR. so this is 2 · 2 and the rest 8 countries may sit indifferent arrangements.


Thus, we have final answer (1) minus compliment of (2). This gives us

(2)(9)(8!) − (2)(2)(8!) = 564, 440

Another classical example is the grid problem from homework and also textbook.One can refer to the following.

Consider a grid that has size 3 by 4. We want to move from A to B that requires3 ups and 4 rights. It does not matter which move you go first.

Answer. Thus, we have (3 + 4

3

)=

(73

)=

(74

)= 35

Positive SolutionGiven an equation in the form of

r∑i=1

xi = n

you need to be familiar how to solve this type of problems. Be aware of both premises(1) assume positive solution, and (2) assume non-negative solution for Xi’s.

You should recall formula(n − 1r − 1

)for positive integer-valued vector (x1, ..., xr)

and (n + r − 1

r − 1

)for non-negative integer-valued vector (x1, ..., xr)

A classical example is from Homework 1. Given 8 identical blackboards to beidentically distributed among 4 schools, we want to find out

(a) How many distributions are possible? This is

X1 + X2 + X3 + X4 = 8

while allowing “0” so we have

Answer. (8 + 4 − 1

4 − 1

)=

(113

)= 165

(b) How many if at least 1 is distributed to each school? This is not allowing “0”.Hence, we have

Answer. (8 − 14 − 1

)=

(73

)= 35


Positive SolutionBayes’ Formula is definitely within the scope of this exam. You should be familiar withall sorts of formulas in this arena. Let us recall Bayes’ Formula

P (E|F ) = P (EF )P (F )

Please also be aware of the identity (aka multiplication rule of probability).

P (E) = P (E|F )P (F ) + P (E|F c)P (F c)

You should also be aware of the related formulas as one can always derive Bayes’formula in fancy ways depending on the problem.

A classical example is from Homework 3. An ectopic pregnancy is twice as likelyto develop when the pregnant woman is a smoker as it is a non-smoker. If 32 percentof women of childbearing age are smokers, what percentage of women having ectopicpregnancies are smokers?

Answer. Let E be the event that a pregnant women has an ectopic pregnancy, and S bethe event that they are smokers. We know that P(E|S) = 2P(E|Sc) and P(S) = 0.32.Then by Bayes’ Theorem, we find

P(S|E) = P(E|S)P(S)P(E|S)P(S) + P(E|Sc)P(Sc)

= 2P(E|Sc)(0.32)2P(E|Sc)(0.32) + P(E|Sc)(1 − 0.32)

= 0.640.64 + 0.68

= 0.641.32 = 32

66

One can also refer to another problem in Homework 3. A total of 48 percent of thewomen and 37 percent of the men who took a certain “quit smoking” class remainednonsmokers for at least one year after completing the class. These people then attendeda success party at the end of a year. If 62 percent of the original class was male,

1. what percentage of those attending the party were women?2. what percentage of the original class attended the party?

Answer. We answer the question in the following1. Let A be the event that a person attends a party. Let W be the event that the

person is a woman, and M = W c be the event that this person is a man. Thenby Bayes’ Theorem

P(W |A) = P(A|W )P(W )P(A|W )P(W ) + P(A|M)P(M)

= 0.48 × 0.380.48 × 0.38 + 0.37 × 0.62

= 0.44


2. By the law of total probability, we have that

P(A) = P(A|W )P(W ) + P(A|M)P(M) = 0.48 × 0.38 + 0.37 × 0.62 = 0.41

Female chimp gave birth. It is not certain which of two male chimps is the father.Before genetic analysis, it is believed that the probability that male number 1 is thefather is p and the probability that male number 2 is the father is 1−p. DNA obtainedfrom the mother, male number 1, and male number 2 indicates that on one specificlocation of the genome, the mother has the gene pair (A,A), male number 1 has genepair (a,a), and male number 2 has the gene pair (A,a). If a DNA test shows that thebaby chimp has the gene pair (A,a), what is the probability that male number 1 is thefather?

Answer. Let Mi be the event that male number i is the father. Let BA,a be the eventthat baby chimp has the gene pair (A,a). Then P (M1|BA,a) is obtained:

P (M1|BA,a) = P (M1BA,a)P (BA,a)

= P (BA,a|M1)P (M1)P (BA,a|M1)P (M1) + P (BA,a|M2)P (M2)

= 1 · p

1 · p + (1/2)(1 − p)

= 2p

1 + p

Now let us compare result with p

Figure 3: The figure presents the graph of 2p1+p − p.

0.5 1 1.5 2

−1

1

2

p

2p1+p − p

Hence, we arrived the inequality2p

1 + p> p

We conclude the information that the baby’s gene pair is (A,a) increases the probabilitythat male number 1 is the father.


10.2 2nd MidtermThe following notions are important for this midterm.




4. Chapter 4. Density function (PDF), Distribution function (CDF), Expectation(Mean), Variance, Famous distributions (binomial, Poisson, geometric, negativebinomial).

5. Chapter 5. Uniform. Normal. Exponential. Memoryless Property. Applicationon Memoryless Property.

E[X] =∑

x:p(x)>0

xp(x)

The expected value of X is a weighted average of the possible values that X can takeon, each value being weighted by the probability that X assumes it.Example 10.2.1. Find E[X], where X is the outcome when we roll a fair die.

Answer. Since p(1) = p(2) = p(3) = p(4) = p(5) = p(6) = 1/6, we obtain

E[X] = 1(1

6)

+ 2(1

6)

+ 3(1

6)

+ 4(1

6)

+ 5(1

6)

+ 6(1

6)

= 7/2

Example 10.2.2. Calculate Var(X) if X represents the outcome when a fair die isrolled.

Answer. You can easily find E[X] = 72 . Now, we find

E[X2] = 12(1/6) + 22(1/6) + 32(1/6) + 42(1/6) + 52(1/6) + 62(1/6)= (91)(1/6)

and thus we have variance

Var(X) = 916 −

(72

)2

= 3512

Example 10.2.3. Suppose X is a continuous random variable whose probability den-sity function is

f(x) =

C(4x − 2x2) 0 < x < 20 else

1. What is the value of C?2. Find P (X > 1).

Answer. We have the following


1. Since f is a probability density function, we must have∫ ∞

−∞f(x)dx = 1

and we can solve C∫ 2

0 (4x − 2x2)dx = 1. After integration, we have C(2x2 −2x3

3 )|2x=0 = 1 and we have result C = 38 .

2. P (X > 1) =∫ ∞

1 f(x)dx = 38

∫ 21 (4x − 2x2)dx = 1

2 .

Example 10.2.4. The amount of time in hours that a computer functions beforebreaking down is a continuous random variable with probability density function givenby

f(x) =

λe−x/100 x ≥ 00 x < 0

What is the probability that1. a computer will function between 50 and 150 hours before breaking down?2. it will function for fewer than 100 hours?

Answer. We solve the parts accordingly1. Since 1 =

∫ ∞−∞ f(x)dx = λ

∫ ∞0 e−x/100dx, we can take integral and obtain 1 =

−λ(100)e−x/100|∞0 = 100λ. We can solve for λ = 1100 . Then we can proceed to

find the probability

P (50 < X < 1500) =∫ 150

50

1100e−x/100dx

= −e−x/100|15050

= e−1/2 − e−3/2

= 0.383

2. I will leave this to you as an exercise.

In general, we say that X is a uniform random variable on the interval (α, β) if theprobability density function of X is given by

f(x) =

1β−α

if α < x < β

0 else

Since F (a) =∫ a

−∞ f(x)dx, it follows that

f(x) =

0 a ≤ α1

β−αif α < x < β

1 a ≥ β

Example 10.2.5. Let X be uniformly distributed over (α, β). Find (a) E[X] and (b)Var(X).

Answer. We proceed accordingly


1. Compute

E[X] =∫ ∞

−∞xf(x)dx

=∫ β

α

x

β − αdx

= β2 − α2

2(β − α)

= β + α

2

2. To find Var(X), first calculate E[X2].

E[X2] =∫ β

α

1β − α

x2dx

= β3 − α3

3(β − α)

= β2 + αβ + α2

3

Hence,

Var(X) = β2 + αβ + α2

3 − (α + β)2

4 = (β − α)2

12

Example 10.2.6. If X is uniformly distributed over (0, 10), calculate the probabilitythat X < 3.

Answer. Compute P (X < 3) =∫ 3

01

10 dx = 310 .

We say that X is a normal random variable, or simply that X is normally dis-tributed, with parameters µ and σ2 if the density of X is given by

f(x) = 1√2πσ

e−(x−µ)2/2σ2

for −∞ < x < ∞. The density function is a bell-shaped curve that is symmetric aboutµ.Example 10.2.7. Find E[X] and Var(X) when X is a normal random variable withparameters µ and σ2.

Answer. Let us start by finding the mean and variance of the standard normal randomvariable Z = (X − µ)/σ. We have

E[Z] =∫ ∞

−∞xfZ(x)dx

= 1√2π

∫ ∞

−∞xe−x2/2dx

= − 1√2π

e−x2/2|∞−∞

= 0


Thus,

Var(Z) = E[Z2]

= 1√2π

∫ ∞

−∞x2e−x2/2dx, IBP: let µ = x and dν = xe−x2/2

= 1√2π

(− xe−x2/2|∞−∞ +

∫ ∞

−∞e−x2/2dx︸︷︷︸=1

)= 1

Because X = µ + σZ, the preceding yields the results

E[X] = µ + σE[Z] = µ

andVar(X) = σ2Var(Z) = σ2

Example 10.2.8. If X, the gain from an investment, is a normal random variablewith mean µ and variance σ2, then because the loss is equal to the negative of thegain, the VAR of such an investment is that value ν such that

0.01 = P (−X > ν)

We compute the following

0.01 = P

(−X + µ

σ>

ν + µ

σ

)= 1 − Φ(ν + µ

σ)

and from table we know Φ(2.33) = 0.99 so we know ν+µσ

= 2.33. That is, ν = VAR =2.33σ − µ. Consequently, among set of investments all of whose gains are normallydistributed, the investment having the smallest VAR is the one having the largestvalue of µ − 2.33σ.Remark 10.2.9. Please repeat the above analysis for

1. Discrete: Binomial, Poisson, Geometric.2. Continuous: Uniform, Normal, Exponential.

Remark 10.2.10. Resources:1. StatLect Website: https://www.statlect.com/probability-distributions/

2. Univariate Distribution Relationships: http://www.math.wm.edu/˜leemis/chart/UDR/UDR.html

https://www.statlect.com/probability-distributions/

http://www.math.wm.edu/~leemis/chart/UDR/UDR.html

http://www.math.wm.edu/~leemis/chart/UDR/UDR.html


10.3 Final ExamThe following notions are important for this final exam.




4. Chapter 4. Density function (PDF), Distribution function (CDF), Expectation(Mean), Variance, Famous distributions (binomial, Poisson, geometric, negativebinomial).

5. Chapter 5. Uniform. Normal. Exponential. Memoryless Property. Applicationon Memoryless Property.

6. Chapter 6. Joint Cumulative Probability Distribution Function, Joint ProbabilityMass Function

7. Chapter 7. MGF, Expectation, Variance.8. Chapter 8. Markov, Chebyshev.

Please be aware of the following:• Exam is cumulative from Chapter 1 to Chapter 8.• Midterm I & II are very good reference of the final. For problems asking topics

discussed in Chapter 5 and before, Midterm I & II provide very good insight.• For problems asking topics discussed in Chapter 6 and after, please refer to sample

exam.• Yellow highlight from “This is important.” appear in the text can be valuable

reference.

Index

Bayes’ formula, 12Bayes’ Formula, Bayes’ Theorem,

Bayes’ Rule, 13Bernoulli random variable, 19binomial formula, 5binomial random variable, 19Binomial Theorem, 6

Central Limit Theorem, 52Central limit Theorem for

independent randomvariables, 54

Chebyshev’s Inequality, 50Chernoff Bounds, 55combination, 5conditional probabilities, 12convolution, 37covariance, 42cumulative distribution function, 16

DeMorgan’s Law, 7discrete random variable, 16distinct nonnegative integer-valued

solutions, 6distinct positive integer-valued

vectors, 6

expected value, 17

frequentist, 4

Gambler’s Ruin, 13

inclusion-exclusion identity, 8independent, 35

Jensen’s Inequality, 55joint probability mass function, 33

Markov’s Inequality, 48multiplication rule, 11mutually exclusive, 7, 11

normal random variable, 27, 94

One-sided Chebyshev Inequality, 54

permutation, 4Poisson probability: application, 21Poisson random variable, 21posterior, 4prior, 4probability mass function, 16probability mass functions, 19

random variable, 16random variables, 15

sample space, 7standard normal random variable,

properties, 28, 94

The Central Limit Theorem, 52

Uniform random variable, 26

variance of a continuous randomvariable, 25

Weak Law of Large Numbers, 52

97


References[1] Ross, S. “A First Course in Probability”, 9th Edition.

probability theory (undergraduate) - wordpress.com · 2018-12-11 · 2018 fall probability theory...

Documents