continuous probability distributions - ka-fu...

Continuous Probability Distributions

Ka-fu WONG

23 August 2007

Abstract

In previous chapter, we noted that a lot of continuous random variables can be approximated bydiscrete random variables. At the same time, a lot of discrete random variables may be approximated bycontinous random variables. Thus, we are studying continuous probability distributions not just for thesake understanding continuous random variables, but also for the sake of understanding its approximationto discrete random variables. In a lot of important cases, it turns out that the approximation are goodenough and continuous probability distributions are easier to work with – if we know enough aboutit. Among continuous probability distributions, normal distribution is the most important. Normaldistribution will be used over and over again in later chapters.

The most difficult part about continuous probability distributions is the understanding of its connec-tion with the discrete ones. Once this is done, a lot of results about discrete probability distributionscan be easily extended to the continuous case.

A continuous random variable can assume an infinite uncountable number of values within a given range

or interval(s). Recall that a discrete random variable can also take infinite number of values. A continiuous

random variable differs in that the number of values it can take is uncountable, i.e., impossible to list all the

values it can take. For instance, it is not possible to list all the real numbers within the interval [0,1]. The

associated probability distributions to continous random variables is logically called continuous probability

distributions.

Example 1 (Continuous random variables): A continuous random variable is a variable that can

assume any value in an interval. Some of the variables are continuous in nature. For examples:

1. The thickness of our Microeconomics textbook.

2. The time required to complete a homework assignment.

3. The temperature of a patient.

4. The distance travelled from my home to school.

Other variables are continious because they are averages of discrete random variables.

1. Ginni Coefficient, a measure of inequality in an economy.

2. Unemployment rate.

1

3. Inflation rate.

4. A stock market index, such as Hang Seng Index, Dow Jones Averages, Nasdaq, and S&P

500.

5. A student’s grade point average.

6. Average age of students in this class.

7. Average weekly working hours of employees.

8. Average hourly salary of students working part-time.

These variables can potentially take on any value, depending only on our ability of measuring

and reporting them accurately.

The above examples suggest that averages are better characterized as continuous rather than discrete

variables even if the underlying variables used for the mean calculations were discrete.

Conceptually, continuous probability distribution should be very similar to discrete probability distrib-

ution. After all, the two types of probability distributions may be viewed as approximation of each other.

The only main difference is that the probability that a continuous random variable takes a specific value is

zero. However, the probability that a continuous random variable takes a value between an interval [a, b]

can be positive. This difference has led to changes in the definitions and calculations.

1 Features of a Continuous Probability Distribution

Probability distribution may be classified according to the number of random variables it describes.

Number of random variables Joint distribution

1 Univariate probability distribution

2 Bivariate probability distribution

3 Trivariate probability distribution

... ...

n Multivariate probability distribution

These distributions have similar characteristics. We will discuss these characteristics for the univariate

and the bivariate distribution. The extension to the multivariate will be straightforward.

2

Theorem 1 (Charateristics of a Univariate Continuous Distribution): Suppose the random

variable X is defined on the interval between a and b, i.e., X ∈ [a, b]. That is X can take any

value between [a, b].

1. The probability that X takes a value between an interval [c, d] is

P (X ∈ [c, d]) =∫ d

c

f(x)dx

where f(x) denote the probability density function of X. Note that the expression has an

interpretation parallel to the discrete case. f(x)dx may be interpreted as the probability or

the area under the density curve define in the neighborhood of x, say [x, x + dx].

2. The probability density function f(x) is non-negative and may be larger than 1.

3. P (X ∈ [c, d]) =∫ d

cf(x)dx is between 0 and 1.00, i.e., [0, 1].

4. The sum of the probabilities of the various outcomes is 1.00. That is,

P (X ∈ [−∞,∞]) = P (X ∈ [a, b]) =∫ b

a

f(x)dx = 1

5. Let the events defined on the two non-overlapping intervals, [c1, d1] and [c2, d2], be X ∈

[c1, d1] and X ∈ [c2, d2]. These two events are mutually exclusive. That is,

P (X ∈ [c1, d1] and X ∈ [c2, d2]) = 0.

P ((X ∈ [c1, d1] or X ∈ [c2, d2]) = P ((X ∈ [c1, d1]) + P (X ∈ [c2, d2])

The use of integration (∫

) in the definition could be terrifying, especially for students who had never

studied it before. We are terrified only because we do not know there is a simple connection between the

discrete probability distribution and the continuous probability distribution. Let us make the connection

below and leave the introduction of integration in a mathematical appendix.

3

2 Making a connection between discrete and continuous distribu-

tions

To understand the difference between the two types of distributions, let’s start with a series of questions

(from simple to complicated), of which we can easily derive good answers.

2.1 Imagine throwing a dart at [0, 1]

Consider a continuous random variable that is defined over a segment of line [0, 1]. We can imagine throwing

a dart at the segment and each point in the segment has equal chance of being hit by the dart.

1. What is the probability that a dart randomly thrown will end up in the segment [0, 1]? The answer is

simple. Since the dart has to land on somewhere on [0, 1], the probability of the dart landing on the

segment [0, 1] is 1.

2. What is the probability that a dart randomly thrown will end up in the segment [0, 12 ]? The answer is

slightly more difficult. Since the dart has equal chance to land on any point on [0, 1], the probability

of having the dart landing on half of the line [0, 1], i.e., the segment [0, 12 ], has to be 1/2.



of having the dart landing on a quarter of the line [0, 1], i.e., the segment [0, 14 ], has to be 1/4.



of having the dart landing on an eighth of the line [0, 1], i.e., the segment [0, 18 ], has to be 1/8.

5. What is the probability that a dart randomly thrown will end up in the segment [ 28 ,38 ]? The answer is

slightly more difficult. The length of the segment is 1/8, the same as the last question. Since the dart

has equal chance to land on any point on [0,1], the probability of having the dart landing on an eighth

of the line [0,1], i.e., the segment [ 28 ,38 ], has to be 1/8.

6. What is the probability that a dart randomly thrown will end up in the segment [ 28 ,38 ] AND [

58 ,

68 ]?

The answer is not difficult. We have two non-overlapping segments of equal length. It is impossible for

any throw to land on both of these two non-overlapping segments. Thus, the probability should be 0.

4

7. What is the probability that a dart randomly thrown will end up in the segment [ 28 ,38 ] OR [

58 ,

68 ]?

The answer is slightly more difficult. We have two non-overlapping segments of equal length. The

probability should equal the the sum of the probability of landing on the segment [ 28 ,38 ] and the

probability of landing on the segment [ 58 ,68 ]. That is, it should be 1/8+1/8=2/8.

8. What is the probability that a dart randomly thrown will end up in the segment [0, k], where k < 1? The

answer is slightly more difficult. As we learn from the previous discussions, there is a 1/2 probability

of having the dart landing on the segment [0, 12 ], 1/4 on the segment [0,14 ], 1/8 on the segment [0,

18 ].

It is not too difficult to induce that the probability of having the dart landing on the segment [0,k] is

simply k.

9. What is the probability that a dart randomly thrown will end up in the segment [k1, k2], where k1 < k2?

The answer is slightly more difficult. It is not too difficult to induce that the probability of having the

dart landing on the segment [k1, k2] is simply k2 − k1.

10. What is the probability that a dart randomly thrown will end up in the segment [k1, k2] AND [k3, k4],

where k1 < k2 < k3 < k4? The answer is not difficult. We have two non-overlapping segments. It is

impossible for any throw to land on both of these two non-overlapping segments. Thus, the probability

should be 0.

11. What is the probability that a dart randomly thrown will end up in the segment [k1, k2] OR [k3, k4],

where k1 < k2 < k3 < k4? The answer is not difficult. We have two non-overlapping segments. The

probability should equal the the sum of the probability of landing on the segment [k1, k2] and the

probability of landing on the segment [k3, k4]. That is, it should be (k2 − k1) + (k4 − k3).

12. What is the probability that a dart randomly thrown will end up exactly at a single point 2/3? The

answer is slightly more difficult. There are infinite uncountable points on the entire segment [0, 1]. The

probability of the dart ending up at the point 2/3 is like the probability of the dart ending up in an

interval with zero length. That is, it should be zero.

13. What is the probability that a dart randomly thrown will end up exactly at one of the two points, 1/3

or 2/3? The answer is slightly more difficult. Since the two points are non-overlapping, the probability

should be the sum of the individual ones, i.e., 0 = 0 + 0.

14. What is the probability that a dart randomly thrown will end up exactly at one of the 99 points, 1/100,

2/100, ..., or 99/100? The answer is simple. Since the 99 points are non-overlapping, the probability

5

should be the sum of the individual ones, i.e., 0.

2.2 Imagine throwing a dart at [a, b]

Let’s repeat the above questions and answers with a slight change. Consider a continuous random variable

that is defined over a segment of line [a, b], where a < b. We can imagine throwing a dart at the segment

and each point in the segment has equal chance of being hit by the dart.

1. What is the probability that a dart randomly thrown will end up in the segment [a, b]? The answer is

simple. Since the dart has to land on somewhere on [a, b], the probability of the dart landing on the

segment [a,b] is 1.

2. What is the probability that a dart randomly thrown will end up in the segment [a, a+ 12 (b− a)]? The

answer is slightly more difficult. Since the dart has equal chance to land on any point on [a, b], the

probability of having the dart landing on half of the line [a, b], i.e., the segment [a, a + 12 (b − a)], has

to be 1/2.



probability of having the dart landing on a quarter of the line [a, b], i.e., the segment [a, a + 14 (b− a)],

has to be 1/4.



probability of having the dart landing on an eighth of the line [a, b], i.e., the segment [a, a + 18 (b− a)],

has to be 1/8.

5. What is the probability that a dart randomly thrown will end up in the segment [a+ 28 (b−a), a+38 (b−a)]?

The answer is slightly more difficult. The length of the segment is 1/8, the same as the last question.

Since the dart has equal chance to land on any point on [a, b], the probability of having the dart landing

on an eighth of the line [a, b], i.e., the segment [a + 28 (b− a), a +38 (b− a)], has to be 1/8.

6. What is the probability that a dart randomly thrown will end up in the segment [a+ 28 (b−a), a+38 (b−a)]

AND [a + 58 (b− a), a +68 (b− a)]? The answer is not difficult. We have two non-overlapping segments

of equal length. It is impossible for any throw to land on both of these two non-overlapping segments.

Thus, the probability should be 0.

6

7. What is the probability that a dart randomly thrown will end up in the segment [a+ 28 (b−a), a+38 (b−a)]

OR [a + 58 (b − a), a +68 (b − a)]? The answer is slightly more difficult. We have two non-overlapping

segments of equal length. The probability should equal the the sum of the probability of landing on the

segment [a+ 28 (b−a), a+38 (b−a)] and the probability of landing on the segment [a+

58 (b−a), a+

68 (b−a)].

That is, it should be 1/8+1/8=2/8.

8. What is the probability that a dart randomly thrown will end up in the segment [a, k], where k < 1?

The answer is slightly more difficult. As we learn from the previous discussions, it is not too difficult

to induce that the probability of having the dart landing on the segment [a, k] is simply (k−a)/(b−a).

9. What is the probability that a dart randomly thrown will end up in the segment [k1, k2], where k1 < k2?

The answer is slightly more difficult. It is not too difficult to induce that the probability of having the

dart landing on the segment [k1, k2] is simply (k2 − k1)/(b− a).

10. What is the probability that a dart randomly thrown will end up in the segment [k1, k2] AND [k3, k4],

where k1 < k2 < k3 < k4? The answer is not difficult. We have two non-overlapping segments. It is

impossible for any throw to land on both of these two non-overlapping segments. Thus, the probability

should be 0.

What is the probability that a dart randomly thrown will end up in the segment [k1, k2] OR [k3, k4], where

k1 < k2 < k3 < k4? The answer is not difficult. We have two non-overlapping segments. The

probability should equal the the sum of the probability of landing on the segment [k1, k2] and the

probability of landing on the segment [k3, k4]. That is, it should be [(k2 − k1) + (k4 − k3)]/(b− a).

1. What is the probability that a dart randomly thrown will end up exactly at a single point k1? The

answer is slightly more difficult. There are infinite uncountable points on the entire segment [a, b]. The

probability of the dart ending up at the point k1 is like the probability of the dart ending up in an

interval with zero length. That is, it should be zero.

2. What is the probability that a dart randomly thrown will end up exactly at one of the two points, k1

or k2? The answer is slightly more difficult. Since the two points are non-overlapping, the probability

should be the sum of the individual ones, i.e., 0 = 0 + 0.

3. What is the probability that a dart randomly thrown will end up exactly at one of the 99 points, k1,

k2, ..., or k99? The answer is simple. Since the 99 points are non-overlapping, the probability should

be the sum of the individual ones, i.e., 0.

7

2.3 Deriving the probability density function (pdf)

Based on what we know about discrete probability distribution, it might appear that there is inconsistency

between the notions P (x ∈ [k1, k2]) = (k2 − k1)/(b − a) and P (x = k) = 0. Can we write P (x ∈ [k1, k2])

as a sum of probability of individual events? Yes, with an introduction of the density concept. That is,

we would like to define a density c such that P (x ∈ [a, k]) = (k − a) × c. What would c be? We know

that c must also satisfy P (x ∈ [a, b]) = (b − a) × c = 1, implying c = 1/(b − a). It is not too difficult

to check that the density so defined will generate results that are consistent with the discussions above.

In particular, P (x ∈ [a, k]) = (k − a) × c = (k − a)/(b − a), P (x ∈ [k1, k2]) = (k2 − k1)/(b − a), and

P (x = k) = P (x ∈ [k, k]) = (k − k)/(b− a) = 0.

What we have just discussed is the so called uniform distribution over the interval [a, b].

Definition 1 (Uniform distribution): If a and b are numbers on the real line, the random variable

(X) ∼ U(a, b), i.e., has a uniform distribution if the density function looks like

f(x) =

1

(b−a) for a ≤ x ≤ b

0 otherwise

and the cumulative distribution function (cdf) looks like

F (x) = Prob(X < x) =

0 for x ≤ ax−a(b−a) for a ≤ x ≤ b

1 for b ≤ x

Recall that in discrete case, the expected value of a random variable X with probability mass function

P(X) is

E(X) =∑X

XP (X)

In continuous random variable case, P (X = k) = 0, how do we compute the expected value? We can use

an approximation. Since P (X ∈ [k, k + dx]) > 0, it is possible to imagine the relevant segment is divided

into many smaller intervals of length dx ([a, a + dx], [a + dx, a + 2dx], ..., [a + ( b−adx − 1)dx, a + (b−adx )dx]),

and obtain an approximation with the formula above by replacing P (X = k) with P (X ∈ [k, k + dx]), and

replacing X with one of the three possibilities:

8

1. the lower bound of [k, k + dx], i.e, k.

E(X) =( b−adx )∑i=1

(a + (i− 1)dx)P (X ∈ [a + (i− 1)dx, a + idx]) =( b−adx )∑i=1

(a + (i− 1)dx)× c× dx)

2. the upper bound of [k, k + dx], i.e, k + dx.

E(X) =( b−adx )∑i=1

(a + idx)P (X ∈ [a + (i− 1)dx, a + idx]) =( b−adx )∑i=1

(a + idx)× c× dx

3. the mid-point of [k, k + dx], i.e, k + dx/2.

E(X) =( b−adx )∑i=1

(a + (i− 12)dx)P (X ∈ [a + (i− 1)dx, a + idx]) =

( b−adx )∑i=1

(a + (i− 12)dx)× c× dx

The accuracy of such approximation depend on dx. Generally, the smaller dx, the more accurate is the

approximation. To obtain a more accurate approximation, we can imagine dx shrinking towards zero. In

this case the number of terms being added expands to infinity.

E(X) = limdx→0

( b−adx )∑i=1

(a + (i− 1)dx)× c× dx = limdx→0

( b−adx )∑i=1

xi × c× dx =∫ b

a

xcdx

a

(x1)

x

c=1/(b-a)

xi xi+dx

y=f(x)

bx2dx dx

x3dx

9

In the expression above, we are really using the integral sign to stand for the limit of the sum.

limdx→0

( b−adx )∑i=1

=∫ b

a

In the above discussion, we consider only the case when the dart will land on an interval [a, b] with

equal chance, i.e., uniform distribution. The discussion can be easily extended to other situations with

the interval [a, b] subdivided into many sub-intervals and the densities are held the same within each sub-

intervals. Let the interval be of length dx as before, we have the intervals [a, a+dx], [a+dx, a+2dx], ..., [a+

( b−adx − 1)dx, a + (b−adx )dx], and the corresponding densities in each intervals can be labelled differently as

f(x1), ..., f(xn), where n = (b − a)/dx. With this information, we can answer many questions similar to

those discussed earlier. Again, at the limiting case with dx approaching zero, the probability will be defined

as

P (x ∈ [k1, k2]) = limdx→0

∑[k1,k2]

fidx =∫ k2

k1

f(x)dx

More generally we can define the cumulative distribbution function (cdf) and use cdf to compute the

probability.

Definition 2 (Cumulative distribbution function (cdf)): Let f(x) be the pdf of the a continuous

random variable X. The cumulative distribbution function (cdf) is

F (x) = P (X < x) =∫ x−∞

f(x)dx

P (x ∈ [k1, k2]) =∫ k2

k1

f(x)dx =∫ k2−∞

f(x)dx−∫ k1−∞

f(x)dx = F (k2)− F (k1)

Given the cdf F (x), we can also derive the pdf f(x)

f(x) =d

dxF (x)

where ddx means “differentiate with respect to x”.

We can check such definitions are consistent at least with the uniform distribution. In fact, it holds for

10

any continuous probability distributions.

The above discussions suggest that the density function defines the probability distribution in the con-

tinuous case.

Example 2 (Part-time Work on Campus): A student has been offered part-time work in a

laboratory. The professor says that the work will vary from week to week. The number of hours

will be between 10 and 20 with a uniform probability density function:

1. How tall is the rectangle?

2. What is the probability of getting less than 15 hours in a week?

3. Given that the student gets at least 15 hours in a week, what is the probability that more

than 17.5 hours will be available?

x

c=0.1

y=f(x)

2010

Because the probability is uniformly distributed, the pdf can be illustrated as an rectangle above

and the height of the rectangle is the very uniform probabilty, i.e. 1/(20 − 10) = 0.1. If we

employ the same letters we used previously, we have a = 10, b = 20, and c = 0.1. The pdf is

f(x) =

1

20−10 = 0.1 for 10 ≤ x ≤ 20

0 otherwise

11

and the cdf is

F (x) = Prob(X < x) =

0 for x ≤ 10x−1020−10 for 10 ≤ x ≤ 20

1 for 20 ≤ x

Thus Prob(X < 15) = 15−1020−10 = 0.5, and Prob(X > 17.5|X > 15) =Prob(X>17.5&X>15)

Prob(X>15) =Prob(X>17.5)Prob(X>15) =

17.5−1015−10 = 0.5.

Remember that the probability for a continuous random variable to be equal to an exact value

is always zero. Therefore we are always relieved from considering about the boundary of any

intervals, i.e. “” or “≥”.

Example 3 (Customer Complaints): You are the manager of the complaint department for a

large mail order company. Your data and experience indicate that the time it takes to handle a

single call denote ranges from 0 to 15 minutes and is denoted as T and has a rectangular triangle

probability density function, with a height of 2/15.

1. Show that the area under the triangle is 1.

2. Find the probability that a call will take longer than 10 minutes. That is, find P (T > 10).

3. Given that the call takes at least 5 minutes, what is the probability that it will take longer

than 10 minutes? That is, find P (T > 10|T > 5).

4. Find P (T < 10).

We know the area under the pdf curve is the probability and it must be 1 because from 0 to 15

are all possible outcomes. The pdf of the function should be like the follows if we believe the

company tries to handle all calls as soon as possible:

12

t

2/15

y=f(t)

15100 5

P (T > 10) should be the area under the pdf curve where 15 > t > 10. Basic geometry helps us

to get that area equals 1/9, i.e. P (T > 10) = 19 = 0.1111.

P (T > 10|T > 5) = P (T>10&T>5)P (T>5) =P (T>10)P (T>5) =

1/94/9 = 0.25.

P (T < 10) = 1− P (T > 10) = 1− 1/9 = 8/9 = 0.8889.

3 Normal distributions

The most popular distribution we will use after this chapter is normal distribution. It pays to understand it

very well.

Definition 3 (Normal distribution): The random variable X ∼ N(µ, σ2), i.e., has a normal

distribution if the random variable is defined on the whole real line [−∞,∞] and has the following

density function

f(x) =1

σ√

2πe−

12 �

2

� =x− µ

σ

where µ and σ are the mean and standard deviation of the random variable (hence σ2 is the

variance) , π = 3.14159..., and e = 2.71828... is the base of natural or Naperian logarithms.

Normal distributions are characterized by its mean and variance. Often, a normal random vari-

13

able, X, distributed as normal with mean µ and variance σ2 will be denoted as

X ∼ N(µ, σ2)

where “∼” reads “distributed as”.

Normal distribution has several main characteristics:

1. It is bell-shaped and single-peaked (unimodal) at the exact center of the distribution, µ.

2. Symmetrical about its mean. The arithmetic mean, median, and mode of the distribution

are equal and located at the peak. Thus half the area under the curve is above the mean

and half is below it.

3. The normal probability distribution is asymptotic. That is the curve gets closer and closer

to the X-axis but never actually touches it.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

-5 -4 -3 -2 -1 0 1 2 3 4 5

N(0,0.3)

N(0,0.5)

N(0,1)

N(-1,0.5) N(3,0.5)

N(0,2)

Probabi l i ty

x

Given the density function as described above, unlike uniform distribution, it is not easy to integrate to

obtain probability of a normal random variable lying within a segment. It is not to difficult to get Excel

to do the calculation, even for different combinations of mean and variance. However, many years ago,

computational power was limited. Statistician worked very hard to come up with a table people can easily

read off from it the cumulative distribution function of normal random variable, i.e., P (X < x). Do we

need to have a table for each combination of µ and σ2? It turns out that all normal random variables

can be transformed to standard normal random variables easily. Thus, for those who know how to do the

transformation, we only need one table, the standard normal table.

14

Definition 4 (Standard Normal distribution): A standard normal random variable is a normal

random variable with zero mean (µ = 0) and unit standard deviation (σ = 1). Its probability

density is defined on a real line (from −∞ to ∞)

f(x) =1√2π

e−12 x

2

0

0.2

0.4

0.6

0.8

1

1.2

-2.5 -1.5 -0.5 0.5 1.5 2.5

cdf

pdf

Probabi l i ty

x

The standard normal distribution is sometimes known as z-distribution.

Theorem 2 (Transform to Standard Normal Distribution): A linear transformation of a normal

random variable will remain normal. In particular, any normal random variable with mean µ

and standard deviation σ can be transformed to a standard normal random variable

Z =X − µ

σ

Thus, P (X ∈ [a, b]) = P (Z ∈ [(a − µ)/σ, (b − µ)/σ]). With this property, we do not need a separate

probability tables for different µ and σ. Instead, we only need one table — the standard normal table.

Although we can easily calculate probability of such normal random variable of different combinations of

mean and variance lying in an interval with the help of a computer, we need to learn how to use the tables

in exams.

Example 4 (How to read the standard normal distribution table):Suppose we have the following

part of a standard normal distribution table accompanied with a graph:

15

z ... 0.03 0.04 0.05 0.06 0.07 ...

... ... ... ... ... ... ... ...

0.7 ... 0.2673 0.2704 0.2734 0.2764 0.2794 ...

0.8 ... 0.2967 0.2995 0.3023 0.3051 0.3078 ...

0.9 ... 0.3238 0.3264 0.3289 0.3315 0.334 ...

1 ... 0.3485 0.3508 0.3531 0.3554 0.3577 ...

1.1 ... 0.3708 0.3729 0.3749 0.377 0.379 ...

... ... ... ... ... ... ... ...

x

0 z

Probability

Let’s first make clear what does this table mean. The most left column together with the top

row combines to make the number we want, i.e. the z in the graph. The inner part of the table

is the resulting probability, i.e. the yellow area in the graph. For example, suppose we want to

know Prob(0 < X < 0.84), we just find the row of 0.8 and the column of 0.04, and then read the

the number in the cell, i.e. 0.2995, or we have Prob(0 < X < 0.84) = 0.2995.

Try the follows:

1. Prob(0 < X < 1.163)

This is the same as our previous example. Because the table gives only approximate values,

we have to round 1.163 to its nearest percentile, 1.16 (i.e. we assume Prob(0 < X <

1.163)=Prob(0 < X < 1.16) approximately). Then we just look the value up from the table

at row 1.1 and column 0.06, which is simply 0.377.

2. Prob(X > 0.77)

First remember one of the very nice properties of normal distribution is its symmertry,

i.e. Prob(X > µ) = Prob(X < µ) = 0.5. This point tells us to get Prob(X > 0.77)

we need just one step more. We first find out that Prob(0 < X < 0.77) = 0.2794. Then

Prob(X > 0.77) = 0.5− 0.2794 = 0.2206.

3. Prob(0.77 < X < 1.16)

We should have no trouble to get Prob(0 < X < 0.77) = 0.2794 and Prob(0 < X < 1.16) =

0.377, as we have done already. Then Prob(0 < X < 1.16) is simply their difference:

Prob(0 < X < 1.16) = 0.377− 0.2794 = 0.0976.

4. Prob(−0.85 < X < 0)

16

Symmertry helps again here, telling that Prob(−0.85 < X < 0) = Prob(0 < X < 0.85). We

have no problem finding Prob(0 < X < 0.85), which is actually at row 0.8 and column 0.05,

0.3023.

5. Prob(X < −1.16)

We apply symmertry again to get Prob(X < −1.16) = Prob(X > 1.16) = 0.5 − Prob(0 <

X < 1.16) = 0.123.

6. Prob(−0.85 < X < 0.77)

Since for X it is mutually exclusive to fall into either (−0.85, 0) or (0, 0.77), we know

Prob(−0.85 < X < 0.77) = Prob(−0.85 < X < 0) + Prob(0 < X < 0.77) = 0.3023 +

0.2794 = 0.5817.

7. Sometimes we want reversely use the table to find the particular z instead of the probability.

Suppose Prob(−k < X < k) = 0.75, (k > 0). What is the k here?

Considering the symmertry of normal distribution, 0.75 = Prob(−k < X < k) = 2 ×

Prob(0 < X < k), i.e. Prob(0 < X < k) = 0.375. Therefore we look for 0.375 (or some

other number that is closest to 0.375) in the table. We find on row 1.1 and column 0.05 the

probability is 0.3749. Therefore a reasonably precise result would be k = 1.15.

Recall that all normal distributions with different µand σ, i.e. different X N(µ, σ2) can be

transferred to standard normal through a simple linear transformation:

Z =X − µ

σ

Thus the standard normal distribution table helps us much more than the aforementioned. See

the following more examples:

1. Given X N(1, 0.25), what is Prob(0 < X < 1.5)?

Transfer X to the standard normal Z, we have Prob(0 < X < 1.5) = Prob( 0−10.5 < Z <

1.5−10.5 ) = Prob(−2 < Z < 1). We are capable to calculate this through the above exercises.

Note that the given table in this example is not enough and a full table covering −2and 1

must be found. The answer should be close to Prob(0 < X < 1.5) = 0.8185, as different

tables report at different accuracy.

2. Given X N(−2, 4), and Prob(−1 < X < k) = 0.10. What is k?=0.6915+0.1=0.7915

17

Again we do the standard normal transfer first. The lower boundary -1 would be transferred

to −1−(−2)2 = 0.5; the upper boundary k would be1−k2 after the linear transfer. Thus

Prob(−1 < X < 1) = Prob(.5 < Z < 1− k2

) = Prob(0 < Z <1− k

2)−Prob(0 < Z < .5) = 0.1

or

Prob(0 < Z <1− k

2) = 0.1 + Prob(Z < .5)

= 0.1 + 0.1915

= 0.2915

Then we look for the closest number to 0.2915 from a full standard normal table. At ordinary

accuracy requiremnt, 1−k2 = 0.81 would be good enough, or k = 1− 2× 0.81 = −0.62.

Example 5 (Normal distribution): Suppose you work in Quality Control for GE. Light bulb life

has a normal distribution with µ = 2000 hours and σ = 200 hours.

1. What is the probability that a bulb will last Between 2000 and 2400 hours?

P (2000 < X < 2400) = P [(2000− µ)/σ < (X − µ)/σ < (2400− µ)/σ]

= P [0 < (X − µ)/σ < (2400− µ)/σ]

= P [0 < Z < 2]

= 0.4772

2. What is the probability that a bulb will last less than 1470 hours?

P (X < 1470) = P [(X − µ)/σ < (1470− µ)/σ]

= P [Z < −2.65]

= P [Z > 2.65]

= 0.5− P [0 < Z < 2.65]

= 0.5− 4960

= 0.004

18

Example 6 (Normal distribution): The daily water usage per person in New Providence, New

Jersey is normally distributed with a mean of 20 gallons and a standard deviation of 5 gallons.

1. About 68 percent of those living in New Providence will use how many gallons of water?

Note that we know that P (µ − σ < X < µ + σ) = 0.6826. Thus, about 68% of the daily

water usage will lie between 15 (µ− σ) and 25 (µ + σ) gallons.

2. What is the probability that a person from New Providence selected at random will use

between 20 and 24 gallons per day?

P (20 < X < 24) = P [(20− 20)/5 < (X − 20)/5 < (24− 20)/5] = P [0 < Z < 0.8]

The area under a normal curve between a z-value of 0 and a z-value of 0.80 is 0.2881. We

conclude that 28.81 percent of the residents use between 20 and 24 gallons of water per day.

3. What percent of the population use between 18 and 26 gallons of water per day?

P (18 < X < 26) = P [(18− 20)/5 < (X − 20)/5 < (26− 20)/5]

= P (−0.4 < Z < 1.2)

= P (−0.4 < Z < 0) + P (0 < Z < 1.2)

= P (0 < Z < 0.4) + P (0 < Z < 1.2)

= 0.1554 + 0.3849

= 0.5403

Example 7 (Normal distribution): Professor Wong has determined that the scores in his statis-

tics course are approximately normally distributed with a mean of 72 and a standard deviation

of 5. He announces to the class that the top 15 percent of the scores will earn an A. What is the

lowest score a student can earn and still receive an A?

To begin let k be the score that separates an A from a B. 15 percent of the students score more

than k, then 35 percent must score between the mean of 72 and k.

1. Write down the relation between k and the probability: P (X > k) = 0.15 and P (X < k) =

1− P (X > k) = 0.85

19

2. Transform X into z:

P [(X − 72)/5) < (k − 72)/5] = P [Z < (k − 72)/5]

3. We look for s = (k − 72)/5 such that

P [0 < Z < s] = 0.85− 0.5 = 0.35

From the standar normal table, we find s = 1.04:

P [0 < z < 1.04] = 0.35

4. Compute k: (k − 72)/5 = 1.04 implies k = 77.2

Thus, those with a score of 77.2 or more earn an A.

Example 8 (Stock returns): Let X be daily stock returns (percentage change of daily stock

prices). Suppose X is distributed as normal with mean 0 and variance 4, i.e., X ∼ N(0, 4).

What is the probability that the daily stock returns will lie between -2 and +2?

First make the linear transformation from (−2, 2) N(0, 4) to standard normal Z:

P (−2 < X < 2) = P (−2− 0√4

< Z <2− 0√

4) = P (−1 < Z < 1)

and because of the symmetry of normal distribution,

P (−2 < X < 2) = 2P (0 < Z < 1)

Then we can easily find the probability form the standard normal table:

P (−2 < X < 2) = 2× 0.3413 = 0.6826

The probability that the daily stock returns will lie between -2 and +2 is 0.6826.

Example 9 (Personal income): Let X be monthly personal income (in dollars). Suppose

log(X) is distributed as normal with mean 9 and variance 16, i.e., log(X) ∼ N(9, 16). What is

20

the probability that the monthly personal income of a randomly drawn person will be less than

5000? Is there a reason we assume log(X) instead of X to follow a normal distribution?

For simlicity, let us denote Y = log(X) as a new random variable, and we know Y follows a

normal distribution of N(9, 16). Because Y = log(X) is a monotonically increasing function

(i.e. the value of Y does not decrease as X increases), for the interval that X < 5000, we have

Y < log(5000) = 8.5172.

Next we can transfer Y to standard normal, Z:

P (Y < 8.5172) = P (Z <8.5172− 9√

16= −0.12) = 0.5− P (0 < Z < 0.12)

Looking up the standard normal table, the probability is 0.5− 0.0478 = 0.4522.

3.1 Checking for normality

Normal distribution is one of the most important distributions. But, how do we know that data is distributed

as normal, at least approximately? There are at least two ways to check.

First, we can check the moments. We know that normal distributed random variable has zero skewness

(due to symmetry of the distribution) and zero excess kurtosis1. If the sample skewness and excess kurtosis

are close to zero, we will be convinced that the data is likely from a normally distributed population.

Simulation 1 (Checking the skewness and kurtosis for normality):

1. Generate 50 observations from a standard normal distribution N(0, 1), another 50 observa-

tions from uniform distribution U(0, 1). Compute their sample skewness and excess kurto-

sis2.

skewness =√

n∑n

i=1(xi − x̄)3

[∑n

i=1(xi − x̄)2]3/2

kurtosis =n

∑ni=1(xi − x̄)4

[∑n

i=1(xi − x̄)2]2 − 3

2. Repeat last step 1000 times. Report the skewness and excess kurtosis calculation, and the

average skewness over these 1000 repetitions.1For instance, refer to section 16.7 Tests for Skewness and Excess Kurtosis, p.567 of Estimation and Inference in Econometrics

by Davidson and MacKinnon.2Note that a slight adjustment is need to obtain an unbiased estimator of these statistics.

21

Below is a group of possible resulted we have generated:

Distribution Observation # Average Skewness Average Excess KurtosisU(0,1) 50 0.0114 -1.1496U(-2,2) 50 -0.0077 -1.1375N(0,1) 50 -0.0008 0.0031N(-2,5) 50 0.0077 -0.0290

The table reports that normal distribution has much closer-to-zero skewness and excess kurtosis

than uniform distribution, and this holds regardless of the means and variances.

[Refer to sim1.xls]

Alternatively, we can use normal probability plot. Suppose we have n observations in the sample.

1. Sort them in ascending order.

2. Compute the empirical z value (i.e., (x− x̄)/σ).

3. Generate a column 0.5, 1.5, ..., [0.5 + (n− 1)]. Call this column U .

4. Generate another column p(z) = U/n.

5. Generate another column theoretical z = NORMSINV (p(z)).

6. Plot empirical z against the theoretical z.

If the data has normal distribution, the plot should be a straight line. We illustrate the steps to do so in the

following example.

Example 10 (normal probability plot):

1. Generate 1000 observations from N(0, 1).

2. Sort them in ascending order.

3. Compute the empirical z value (i.e., (x− x̄)/σ).

4. Generate a column 0.5, 1.5, ..., [0.5 + (n− 1)]. Call this column U .

5. Generate another column p(z) = U/n.

6. Generate another column theoretical z = NORMSINV (p(z)).

7. Plot empirical z against the theoretical z.

22

8. Repeat with 1000 observations drawn from U(0, 1).

The two plots are shown below.

Normal probablity plot (underlying population = normal)

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

3.5

-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5

Theoretical z value

z va

lue

fro

m d

ata

Normal probablity plot (underlying population = uniform)

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

3.5

-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5

Theoretical z value

z va

lue

fro

m d

ata

As obvious from the two plots, the plot lies more or less on the straight line when underlying

population is normal but not when the underlying population is uniformly distributed.

23

4 Bivariate distributions

Except for the use of integrations, the bivariate distributions of continuous random variables are not very

different from that of discrete random variables.

Theorem 3 (Characteristics of a Bivariate Continuous Distribution): If X and Y are continuous

random variables that are defined on intervals [a, b] and [c, d] respectively. The joint probability

density is denoted as f(x, y).

1. The probability density function f(x, y) is non-negative and may be larger than 1.

2. The probability that X takes a value between an interval [m,n] and Y between [k, l] is

P (X ∈ [m,n], Y ∈ [k, l]) =∫ x=n

x=m

∫ y=ly=k

f(x, y)dxdy

3. [P (X ∈ [m,n], Y ∈ [k, l]) =∫ x=n

x=m

∫ y=ly=k

f(x, y)dxdy is between 0 and 1.00.

4. The sum of the probabilities of the various outcomes is 1.00. That is,

P (X ∈ [−∞,∞], Y ∈ [−∞,∞]) = P (X ∈ [a, b], Y ∈ [c, d]) =∫ x=b

x=a

∫ y=dy=c

f(x, y)dxdy = 1

5. Let the events defined on the two non-overlapping regions, ([m1, n1], [k1, l1]) and ([m2, n2], [k2, l2])

be X ∈ [m1, n1], and Y ∈ [k1, l1] and X ∈ [m2, n2] and Y ∈ [k2, l2] These two events are

mutually exclusive. That is,

P ((X ∈ [m1, n1] and Y ∈ [k1, l1]) and (X ∈ [m2, n2] and Y ∈ [k2, l2])) = 0.

P ((X ∈ [m1, n1] and Y ∈ [k1, l1]) or (X ∈ [m2, n2] and Y ∈ [k2, l2]))

= P (X ∈ [m1, n1] and Y ∈ [k1, l1]) + P (X ∈ [m2, n2] and Y ∈ [k2, l2])

6. The marginal density function of X is

f(x) =∫ ∞

y=−∞f(x, y)dy =

∫y

f(x, y)dy.

Note that the marginal density function of X is used when we do not care about the values

24

Y takes. Similarly the marginal density function of Y is

f(y) =∫ ∞

x=−∞f(x, y)dx =

∫x

f(x, y)dx.

7. The conditional density function of X given Y :

f(x|y) =

f(x, y)/f(y) if f(y) > 00 if f(y) = 0Definition 5 (Independent bivariate uniform distribution): If a, b, c and d are numbers on the

real line, , the random variable (X1, X2) ∼ U(a, b, c, d), i.e., has a independent bivariate uniform

distribution if

f(x1, x2) =

1

(b−a)(d−c) for a ≤ x1 ≤ b and c ≤ x2 ≤ d

0 otherwise

Definition 6 (Bivariate normal distribution): The two random variables (X, Y ) ∼ N(µx, µy, σ2x, σ2y, ρ),

i.e., has a bivariate normal distribution with a correlation coefficient of ρ if the two random vari-

ables are defined jointly on the whole real line [−∞,∞] and has the following density function

f(x, y) =1

2πσxσy√

1− ρ2e−

12 [�

2x+�

2y−2ρ�x�y)/(1−ρ

2)]

�x =x− µx

σx

�y =y − µy

σy

5 Expectations

The expectation plays a central role in statistics and Economics. Expectation (often known as mean) reports

the central location of the data. Sometimes, it is also known as the long-run average value of the random

variable, i.e., the average of the outcomes of many experiments.

25

Definition 7 (Expectation (mean)): Let X be a continuous random variable defined over [a, b],

with a probability density function of f(x). The expectation of X is

E(X) =∫ x=∞

x=−∞xf(x)dx =

∫x

xf(x)dx

The expectation, E(X), is often denoted by a Greek letter µ (pronounced as mu).

Thus, expectation of a random variable is a weight average of all the possible values of the random

variable, weighted by its probability density function.

Definition 8 (Conditional Expectation): For bivariate probability distribution, the conditional

expectation or conditional mean E(X|Y = y) is computed by the formula:

E(X|Y = y) =∫

x

xf(x|y)dx

The unconditional expectation or mean of X is related to the conditional mean.

E(X) =∫

y

E(X|Y = y)f(y)dy

= E[E(X|Y )]

Theorem 4 (Expectation of a linear transformed random variable): If a and b are constants and

X is a random variable, then

1. E(a) = a

2. E(bX) = bE(X)

3. E(a + bX) = a + bE(X)

Proof: In our proof, we will only show the most general case E(a + bX) = a + bE(X).

E(a + bX) =∫

x

(a + bx)f(a + bx)dx

=∫

x

(a + bx)f(x)dx

=∫

x

af(x)dx +∫

x

bxf(x)dx

= a∫

x

f(x)dx + b∫

x

xf(x)dx

26

= a + bE(X)

Definition 9 (Variance): Let X be a continuous random variable defined over [a, b], with a

probability density function of f(x). The variance of X is

V (X) =∫

x

(x− E(X))2 f(x)dx

The variance, V (X), is often denoted by a Greek letter σ2 (pronounced as sigma square).

Note that variance of a random variable is the expectation of squared deviation of the random variable

from its mean. That is, if we define a tranformed variable as Z = ((X − E(X))2, V (X) = E(Z). Thus,

we will expect the variance of a transformed variable will be similar to the ones of the expectation of a

transformed variable.

Example 11 (Variance of a random variable): Suppose X and Y are jointly distributed random

variables with probability density function f(x, y). The variance of X is

V (X) =∫

y

∫x

(x− E(X))2f(x, y)dxdy

Definition 10 (Conditional Variance): For bivariate probability distribution, the conditional

expectation or conditional mean V (X|Y ) is computed by the formula:

V (X|Y = y) =∫

x

(x− E(X|Y = y))2f(x|y)dx

Theorem 5 (Variance of a linear transformed random variable): If a and b are constants and


1. V (a) = 0

2. V (bX) = b2V (X)

3. V (a + bX) = b2V (X)

Proof: In our proof, we will only show the most general case V (a + bX) = b2V (X).

V (a + bX) = E[(a + bX)− (a + bE(X))]2

27

= E[(bX − bE(X)]2

= E[b(X − E(X))]2

= E[b2(X − bE(X)2]

= b2E[(X − bE(X)2]

= b2V (X)

Definition 11 (Covariance): Covariance between two random variables X and Y is

C(X, Y ) = E[(X − E(X))(Y − E(Y ))]

=∫

x

∫y

(x− E(X)(y − E(Y ))PXY (x, y)

Note that the covariance can be written as

C(X, Y ) = E[(X − E(X))(Y − E(Y ))]

= E[XY − E(X)Y −XE(Y ) + E(X)E(Y )]

= E[XY ]− E[E(X)Y ]− E[XE(Y )] + E[E(X)E(Y )]

= E[XY ]− E(X)E(Y )− E(X)E(Y ) + E(X)E(Y )

= E[XY ]− E(X)E(Y )

Theorem 6 (Covariance of a linear transformed random variable): If a and b are constants and


1. C(a, b) = 0

2. C(a, bX) = 0

3. C(a + bX, Y ) = bC(X, Y )

Proof: In our proof, we will only show the most general case C(a + bX, Y ) = bC(X, Y ).

C(a + bX, Y ) = E{[(a + bX)− (a + bE(X))][Y − E(Y )]}

= E{[(bX − bE(X)][Y − E(Y )]}

= E{[b(X − E(X))][Y − E(Y )]}

28

= bE{[X − E(X)][Y − E(Y )]}

= bC(X, Y )

Theorem 7 (Variance of a sum of random variables): If a and b are constants, X and Y are

random variables, then

1. V (X + Y ) = V (X) + V (Y ) + 2C(X, Y )

2. V (aX + bY ) = a2V (X) + b2V (Y ) + 2abC(X, Y )

Proof: In our proof, we will only show the most general case V (aX +bY ) = a2V (X)+b2V (Y )+

2abC(X, Y ).

V (aX + bY ) = E[(aX + bY )− (aE(X) + bE(Y ))]2

= E[aX − aE(X) + bY − bE(Y )]2

= E[a(X − E(X)) + b(Y − E(Y ))]2

= E[a2(X − E(X))2 + b2(Y − E(Y ))2 + 2ab(X − E(X))(Y − E(Y ))]

= a2E[(X − E(X))2] + b2E[(Y − E(Y ))2] + 2abE[(X − E(X))(Y − E(Y ))]

= a2V (X) + b2V (Y ) + 2abC(X, Y )

Definition 12 (Independence): Consider two random variables X and Y with joint probability

density f(x, y), marginal probability f(x), f(y), conditional probability f(x|y) and f(y|x).

1. They are said to be independent of each other if and only if

f(x, y) = f(x)× f(y) for all x and y.

X and Y are independent if each cell probability, f(x, y), is the product of the corresponding

marginal probability density function.

2. X is said to be independent of Y if and only if

f(x|y) = f(x) for all x and y.

29

3. Y is said to be independent of X if and only if

f(y|x) = f(y) for all x and y.

Theorem 8 (Consequence of Independence): If X and Y are independent random variables, we

will have

E(XY ) = E(X)E(Y )

E(XY ) = E(X)E(Y ) does not always imply that the random variables X and Y are independent.

6 The Normal Approximation to the Binomial

The normal distribution (a continuous distribution) yields a good approximation of the binomial distribution

(a discrete distribution) for large values of n.

Recall for the binomial experiment:

1. There are only two mutually exclusive outcomes (success or failure) on each trial.

2. A binomial distribution results from counting the number of successes.

3. Each trial is independent.

4. The probability is fixed from trial to trial, and the number of trials n is also fixed.

The normal probability distribution is generally a good approximation to the binomial probability distribu-

tion when nπ and n(1−π) are both greater than 5 – because of the Central Limit Theorem (to be discussed

in next Chapter). However, because the normal distribution can take all real numbers (is continuous) but

the binomial distribution can only take integer values (is discrete), we will need to correct for the continuity.

A normal approximation to the binomial should identify the binomial event “8” with the normal interval

“(7.5, 8.5)” (and similarly for other integer values).

30

Binomial Event Normal Interval

0 (-0.5,0.5)

1 (0.5,1.5)

2 (1.5,2.5)

3 (2.5,3.5)

... ...

x (x− 0.5, x + 0.5)

... ...

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50

B(p,n)=B(.3,100)

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50

N(np,np(1-p))=N(30,21)

Example 12 (Continuity correction in normal approximation of binomial): If n = 20 and

31

π = .25, what is the probability that X is greater than or equal to 8?

The normal approximation without the continuity correction factor yields

Z =(8− 20× .25)

(20× .25× .75)0.5= 1.55

, P (X ≥ 8) = P (Z ≥ 1.55) is approximately .0606 (from the standard normal table).

The continuity correction factor requires us to use 7.5 in order to include 8 since the inequality

is weak and we want the region to the right.

Z =(7.5− 20× .25)

(20× .25× .75)0.5= 1.29

P (X ≥ 7.5) = P (Z ≥ 1.29) is .0985. The exact solution from binomial distribution function is

.1019. Thus, the normal approximation with continuity correction yiled a good approximation to

binomial.

Example 13 (Continuity correction in normal approximation of binomial): A recent study by

a marketing research firm showed that 15% of American households owned a video camera. For

a sample of 200 homes, how many of the homes would you expect to have video cameras?

1. Compute the mean: µ = nπ = 200× .15 = 30.

2. Compute the variance: σ2 = nπ(1− π) = 200× .15× (1− .15) = 25.5. Standard deviation

is σ =√

25.5 = 5.0498.

3. “Less than 40” means “less or equal to 39”. We use the correction factor, so X is 39.5. Hence,

P (X < 39.5) = P [(X−30)/5.0498 < (39.5−30)/5.0498] = P [Z < 1.88] = P [Z < 0]+P [0 <

Z < 1.88] = .5 + .4699 = .9699.

Thus, the likelihood that less than 40 of the 200 homes have a video camera is about 97%.

7 The exponential distribution

Exponential distribution is often used to model the length of time between the occurrences of two events or

between two ocurrences of the same event (the time between arrivals).

Example 14 (Exponential distribution):

32

1. Time taken for your instructor to respond to your email.

2. Time between the birth of two babies.

3. Time taken to find a new job since layoff – the so-called unemployment spell.

4. Time to complete a wage bargaining between a company and a labor union.

5. Time to complete the accession to WTO.

6. Time between two major floods in Sichuan, China.

7. Time taken for the police to solve a crime case.

8. Time taken for obtain a job promotion.

Definition 13 (Exponential distribution):

The exponential random variable T (T > 0) has a probability density function

f(t) = λe−λt for t > 0

where λ is the mean number of occurrences per unit time; t is the length of time until the next

occurrence; e = 2.71828.

The cumulative distribution function (the probability that an arrival time is less than some

specified time t) is

F (t) = Prob(T < t) = 1− e−λt

The mean and variance of an exponential random variable are

E(T ) =1λ

V ar(T ) =1λ2

Note that exponential random variable requires only one parameter, its mean λ (lambda). What are the

other distributions that are completely characterized by one parameter?

Example 15 (Exponential distribution): Customers arrive at the service counter at the rate of

15 per hour. What is the probability that the arrival time between consecutive customers is less

than three minutes?

33

Let T be the arrival time between consecutive customers. The mean number of arrivals per hour

is 15, so λ = 15. Three minutes is .05 hours. Hence, we have

P (T < .05) = 1− e−λt = 1− e−(15)(.05) = 0.5276

So there is a 52.76% probability that the arrival time between successive customers is less than

three minutes.

Example 16 (Exponential distribution): On average, it takes about seven years to accede to

the GATT/WTO. What is the probability that a country takes more than 15 years to accede to

GATT/WTO?

Let T be the time taken. Since E(T ) = 1λ = 7, we have λ =17 . Hence, we have

P (T < 15) = 1− e−λt = 1− e−( 17 )(15) = 0.883

P (T ≥ 15) = 1− P (T < 15) = e−( 17 )(15) = 0.117

Thus, there is a 11.7% probability that it takes more than 15 years to accede to GATT/WTO.3

[Still need to write a lot more numerical examples for different sections!!]

[Plot some figures associated with the numerical examples!!]

3It took China 15 years to accede to GATT/WTO.

34

References

[1] Davidson, Russell, and James G. MacKinnon (1993): Estimation and Inference in Econometrics. Oxford

University Press.

35

Mathematical appendix (A brief introduction to integration)

Integration is a useful tool to estimate the area under a curve above zero. Suppose we have a curve defined

by the function of f(x). Suppose we want to compute the area of f(x) above 0 between a and b. Let’s

consider several cases.

1. Let’s start from the simpliest case when f(x)is a constant, i.e., f(x) = k. In this case, the area is

actually a rectangle. It is easy to compute the area as (b− a)× k.

2. Suppose f(x) = k1 for x ∈ [a, c1] and f(x) = k2 for x ∈ [c1, b]. In this case, the area is actually a sum

of two rectangles. It is still easy to compute the area as (b− c1)× k2 + (c1 − a)× k1.

3. Suppose f(x) = k1 for x ∈ [a, c1], f(x) = k2 for x ∈ [c1, c2]and f(x) = k3 for x ∈ [c2, b]. In

this case, the area is actually a sum of three rectangles. It is still easy to compute the area as

(b− c1)× k1 + (c2 − c1)× k2 + (c2 − a)× k3.

The three examples illustrates that how one can compute the area if the area is a combination of rectangles.

For a general f(x), we can approximate the area by a sum of rectangles. Let’s define

g(x) =

k1 for x ∈ [a, a + dx]

k2 for x ∈ [a + dx, a + 2dx]

k3 for x ∈ [a + 2dx, a + 3dx]

.... ....

or

g(x) = km for x ∈ [a + (m− 1)dx, a + mdx]

If g(x) approximate f(x) well, the approximated area is a sum of rectangles

k1 × dx + k2 × dx + k3 × dx + ...

What are the values of k1, k2, ...? One possibility is to take k1 = f(a). That is,

g(x) = f(a + (m− 1)dx) for x ∈ [a + (m− 1)dx, a + mdx]

36

When the width of each rectangle is small, it does not matter if we take f(a + (m− 1)dx) or f(a + mdx) or

any point in [a + (m− 1)dx, a + mdx].

Thus, the area may be approximated by

f(a)× dx + f(a + dx)× dx + f(a + 2dx)× dx + ...

The approximation is better if dx can be made to be close to zero. The area will be a sum of infinite

very small rectangles. We write the area as ∫ ba

f(x)dx

Mathematicians have developed neat ways to approximate this area. All we need is to borrow those

formula from them.

Theorem 9 (Some integration formula):

1. Let f(x) = k, a constant (such as the density of a uniform distribution). The area under

the curve f(x) in the interval [a, b] is

∫ ba

f(x)dx = k∫ b

a

dx = k(x|b − x|a) = k(b− a)

where g(x)|a = g(a).

2. Let f(x) = kx, a liner curve (such as the product of the random variable and the density of

a uniform distribution). The area under the curve f(x) in the interval [a, b] is

∫ ba

f(x)dx = k∫ b

a

xdx = k(x2

2|b −

x2

2|a) = k(

b2

2− a

2

2)

where g(x)|a = g(a).

Problem sets

We have tried to include some of the most important examples in the text. To get a good understand of

the concepts, it is most useful to re-do the examples and simulations in the text. Work on the following

problems only if you need extra practice or if your instructor assigns them as an assignment. Of course, the

more you work on these problems, the more you learn.

37

Challenge 1 (mean and standard deviation of a uniform distribution): Show that the mean and

standard deviation of a uniform random variable X are

E(X) =c + d

2

and

V (X) =(c− d)2

12

Challenge 2 (mean and standard deviation of a uniform distribution): Show that the exponen-

tial distribution is memoryless, i.e., P (T > s + t|T > s) = P (T > t) for all s, t ≥ 0.

Challenge 3 (mean and standard deviation of an univarate uniform distribution): Let X be

uniformly distributed on the interval [a, b]. Find E(X) and V ar(X).

Challenge 4 (mean and standard deviation of a bivariate uniform distribution): Let (X, Y ) be

jointly uniformly distributed on the rectangle [a, b, c, d]. Find E(X) and V ar(X).

Solutions to problem set

1. It is very straight-forward to guess that the mean of a uniform distribution is simply c+d2 because at

any point between the interval (c, d) the point probability is the same, i.e. f(x) ≡ 1d−c , x ∈ (c, d).

However we can still prove it mathematically:

E(X) =∫ d

c

xf(x)dx

=∫ d

c

x

d− cdx

=1

d− c

∫ dc

xdx

=1

d− c× x

2

2|dc

=1

d− c× (d− c)(d + c)

2

=c + d

2

38

and the variance:

V (X) =∫ d

c

(x− µ)2f(x)dx

=1

d− c

∫ dc

(x2 − 2µx + µ2)dx

=1

d− c× (x

3

3− µx2 + µ2x)|dc

=d2 + cd + c2

3− µ(c + d) + µ2

=d2 + cd + c2

3− (c + d)

2

2+

(c + d)2

4

=(4d2 + 4cd + 4c2)− (6c2 + 12cd + 6d2) + (3c2 + 6cd + 3d2)

12

=(c− d)2

12

2. “Memoryless” is a property of exponential distribution. For example, suppose at a specific crossroad,

every λ hours there is a traffic accident. (a) What is the probability that another accident would happen

in next t hours? (b) Given that s accidents had been witnessed last week, what is the probability that

another accident would happen in next t hours?

(a) P (T > t) = 1− (1− e−λt) = e−λt;

(b) P (T > s + t|T > s) = [1− (1− e−λ(s+t))]/[1− (1− e−λs)] = e−λ(s+t)/e−λs = e−λt.

We find that P (T > t) ≡ (T > s + t|T > s), which means that it matters not how many accidents

had happened before - in another word, memoryless - to determine the future probability of another

accident. (Similar property could be found in “Geomertric distribution”).

3. Refer to 1.

4. We may give a reasonable guess that a single point probability of the bivariate uniform distribution

is 1(b−a)(d−c) , i.e fxy(x, y) ≡1

(b−a)(d−c) , x ∈ (a, b), y ∈ (c, d). Then we can run mathematically to the

expectation:

E(X) = E[E(X|Y )]

= E[∫ b

a

xfx(x|y)dx|Y ]

= E(a + b

2|Y )

39

=∫ d

c

a + b2

dy

=(a + b)(d− c)

2

and the variance is

V (X) =

40

Features of a Continuous Probability DistributionMaking a connection between discrete and continuous distributionsImagine throwing a dart at [0,1]Imagine throwing a dart at [a,b]Deriving the probability density function (pdf)

Normal distributionsChecking for normality

Bivariate distributionsExpectationsThe Normal Approximation to the BinomialThe exponential distribution

continuous probability distributions - ka-fu...

Documents