continuous probability distributions - ka-fu...
TRANSCRIPT
-
Continuous Probability Distributions
Ka-fu WONG
23 August 2007
Abstract
In previous chapter, we noted that a lot of continuous random variables can be approximated bydiscrete random variables. At the same time, a lot of discrete random variables may be approximated bycontinous random variables. Thus, we are studying continuous probability distributions not just for thesake understanding continuous random variables, but also for the sake of understanding its approximationto discrete random variables. In a lot of important cases, it turns out that the approximation are goodenough and continuous probability distributions are easier to work with – if we know enough aboutit. Among continuous probability distributions, normal distribution is the most important. Normaldistribution will be used over and over again in later chapters.
The most difficult part about continuous probability distributions is the understanding of its connec-tion with the discrete ones. Once this is done, a lot of results about discrete probability distributionscan be easily extended to the continuous case.
A continuous random variable can assume an infinite uncountable number of values within a given range
or interval(s). Recall that a discrete random variable can also take infinite number of values. A continiuous
random variable differs in that the number of values it can take is uncountable, i.e., impossible to list all the
values it can take. For instance, it is not possible to list all the real numbers within the interval [0,1]. The
associated probability distributions to continous random variables is logically called continuous probability
distributions.
Example 1 (Continuous random variables): A continuous random variable is a variable that can
assume any value in an interval. Some of the variables are continuous in nature. For examples:
1. The thickness of our Microeconomics textbook.
2. The time required to complete a homework assignment.
3. The temperature of a patient.
4. The distance travelled from my home to school.
Other variables are continious because they are averages of discrete random variables.
1. Ginni Coefficient, a measure of inequality in an economy.
2. Unemployment rate.
1
-
3. Inflation rate.
4. A stock market index, such as Hang Seng Index, Dow Jones Averages, Nasdaq, and S&P
500.
5. A student’s grade point average.
6. Average age of students in this class.
7. Average weekly working hours of employees.
8. Average hourly salary of students working part-time.
These variables can potentially take on any value, depending only on our ability of measuring
and reporting them accurately.
The above examples suggest that averages are better characterized as continuous rather than discrete
variables even if the underlying variables used for the mean calculations were discrete.
Conceptually, continuous probability distribution should be very similar to discrete probability distrib-
ution. After all, the two types of probability distributions may be viewed as approximation of each other.
The only main difference is that the probability that a continuous random variable takes a specific value is
zero. However, the probability that a continuous random variable takes a value between an interval [a, b]
can be positive. This difference has led to changes in the definitions and calculations.
1 Features of a Continuous Probability Distribution
Probability distribution may be classified according to the number of random variables it describes.
Number of random variables Joint distribution
1 Univariate probability distribution
2 Bivariate probability distribution
3 Trivariate probability distribution
... ...
n Multivariate probability distribution
These distributions have similar characteristics. We will discuss these characteristics for the univariate
and the bivariate distribution. The extension to the multivariate will be straightforward.
2
-
Theorem 1 (Charateristics of a Univariate Continuous Distribution): Suppose the random
variable X is defined on the interval between a and b, i.e., X ∈ [a, b]. That is X can take any
value between [a, b].
1. The probability that X takes a value between an interval [c, d] is
P (X ∈ [c, d]) =∫ d
c
f(x)dx
where f(x) denote the probability density function of X. Note that the expression has an
interpretation parallel to the discrete case. f(x)dx may be interpreted as the probability or
the area under the density curve define in the neighborhood of x, say [x, x + dx].
2. The probability density function f(x) is non-negative and may be larger than 1.
3. P (X ∈ [c, d]) =∫ d
cf(x)dx is between 0 and 1.00, i.e., [0, 1].
4. The sum of the probabilities of the various outcomes is 1.00. That is,
P (X ∈ [−∞,∞]) = P (X ∈ [a, b]) =∫ b
a
f(x)dx = 1
5. Let the events defined on the two non-overlapping intervals, [c1, d1] and [c2, d2], be X ∈
[c1, d1] and X ∈ [c2, d2]. These two events are mutually exclusive. That is,
P (X ∈ [c1, d1] and X ∈ [c2, d2]) = 0.
P ((X ∈ [c1, d1] or X ∈ [c2, d2]) = P ((X ∈ [c1, d1]) + P (X ∈ [c2, d2])
The use of integration (∫
) in the definition could be terrifying, especially for students who had never
studied it before. We are terrified only because we do not know there is a simple connection between the
discrete probability distribution and the continuous probability distribution. Let us make the connection
below and leave the introduction of integration in a mathematical appendix.
3
-
2 Making a connection between discrete and continuous distribu-
tions
To understand the difference between the two types of distributions, let’s start with a series of questions
(from simple to complicated), of which we can easily derive good answers.
2.1 Imagine throwing a dart at [0, 1]
Consider a continuous random variable that is defined over a segment of line [0, 1]. We can imagine throwing
a dart at the segment and each point in the segment has equal chance of being hit by the dart.
1. What is the probability that a dart randomly thrown will end up in the segment [0, 1]? The answer is
simple. Since the dart has to land on somewhere on [0, 1], the probability of the dart landing on the
segment [0, 1] is 1.
2. What is the probability that a dart randomly thrown will end up in the segment [0, 12 ]? The answer is
slightly more difficult. Since the dart has equal chance to land on any point on [0, 1], the probability
of having the dart landing on half of the line [0, 1], i.e., the segment [0, 12 ], has to be 1/2.
3. What is the probability that a dart randomly thrown will end up in the segment [0, 14 ]? The answer is
slightly more difficult. Since the dart has equal chance to land on any point on [0, 1], the probability
of having the dart landing on a quarter of the line [0, 1], i.e., the segment [0, 14 ], has to be 1/4.
4. What is the probability that a dart randomly thrown will end up in the segment [0, 18 ]? The answer is
slightly more difficult. Since the dart has equal chance to land on any point on [0, 1], the probability
of having the dart landing on an eighth of the line [0, 1], i.e., the segment [0, 18 ], has to be 1/8.
5. What is the probability that a dart randomly thrown will end up in the segment [ 28 ,38 ]? The answer is
slightly more difficult. The length of the segment is 1/8, the same as the last question. Since the dart
has equal chance to land on any point on [0,1], the probability of having the dart landing on an eighth
of the line [0,1], i.e., the segment [ 28 ,38 ], has to be 1/8.
6. What is the probability that a dart randomly thrown will end up in the segment [ 28 ,38 ] AND [
58 ,
68 ]?
The answer is not difficult. We have two non-overlapping segments of equal length. It is impossible for
any throw to land on both of these two non-overlapping segments. Thus, the probability should be 0.
4
-
7. What is the probability that a dart randomly thrown will end up in the segment [ 28 ,38 ] OR [
58 ,
68 ]?
The answer is slightly more difficult. We have two non-overlapping segments of equal length. The
probability should equal the the sum of the probability of landing on the segment [ 28 ,38 ] and the
probability of landing on the segment [ 58 ,68 ]. That is, it should be 1/8+1/8=2/8.
8. What is the probability that a dart randomly thrown will end up in the segment [0, k], where k < 1? The
answer is slightly more difficult. As we learn from the previous discussions, there is a 1/2 probability
of having the dart landing on the segment [0, 12 ], 1/4 on the segment [0,14 ], 1/8 on the segment [0,
18 ].
It is not too difficult to induce that the probability of having the dart landing on the segment [0,k] is
simply k.
9. What is the probability that a dart randomly thrown will end up in the segment [k1, k2], where k1 < k2?
The answer is slightly more difficult. It is not too difficult to induce that the probability of having the
dart landing on the segment [k1, k2] is simply k2 − k1.
10. What is the probability that a dart randomly thrown will end up in the segment [k1, k2] AND [k3, k4],
where k1 < k2 < k3 < k4? The answer is not difficult. We have two non-overlapping segments. It is
impossible for any throw to land on both of these two non-overlapping segments. Thus, the probability
should be 0.
11. What is the probability that a dart randomly thrown will end up in the segment [k1, k2] OR [k3, k4],
where k1 < k2 < k3 < k4? The answer is not difficult. We have two non-overlapping segments. The
probability should equal the the sum of the probability of landing on the segment [k1, k2] and the
probability of landing on the segment [k3, k4]. That is, it should be (k2 − k1) + (k4 − k3).
12. What is the probability that a dart randomly thrown will end up exactly at a single point 2/3? The
answer is slightly more difficult. There are infinite uncountable points on the entire segment [0, 1]. The
probability of the dart ending up at the point 2/3 is like the probability of the dart ending up in an
interval with zero length. That is, it should be zero.
13. What is the probability that a dart randomly thrown will end up exactly at one of the two points, 1/3
or 2/3? The answer is slightly more difficult. Since the two points are non-overlapping, the probability
should be the sum of the individual ones, i.e., 0 = 0 + 0.
14. What is the probability that a dart randomly thrown will end up exactly at one of the 99 points, 1/100,
2/100, ..., or 99/100? The answer is simple. Since the 99 points are non-overlapping, the probability
5
-
should be the sum of the individual ones, i.e., 0.
2.2 Imagine throwing a dart at [a, b]
Let’s repeat the above questions and answers with a slight change. Consider a continuous random variable
that is defined over a segment of line [a, b], where a < b. We can imagine throwing a dart at the segment
and each point in the segment has equal chance of being hit by the dart.
1. What is the probability that a dart randomly thrown will end up in the segment [a, b]? The answer is
simple. Since the dart has to land on somewhere on [a, b], the probability of the dart landing on the
segment [a,b] is 1.
2. What is the probability that a dart randomly thrown will end up in the segment [a, a+ 12 (b− a)]? The
answer is slightly more difficult. Since the dart has equal chance to land on any point on [a, b], the
probability of having the dart landing on half of the line [a, b], i.e., the segment [a, a + 12 (b − a)], has
to be 1/2.
3. What is the probability that a dart randomly thrown will end up in the segment [a, a+ 14 (b− a)]? The
answer is slightly more difficult. Since the dart has equal chance to land on any point on [a, b], the
probability of having the dart landing on a quarter of the line [a, b], i.e., the segment [a, a + 14 (b− a)],
has to be 1/4.
4. What is the probability that a dart randomly thrown will end up in the segment [a, a+ 18 (b− a)]? The
answer is slightly more difficult. Since the dart has equal chance to land on any point on [a, b], the
probability of having the dart landing on an eighth of the line [a, b], i.e., the segment [a, a + 18 (b− a)],
has to be 1/8.
5. What is the probability that a dart randomly thrown will end up in the segment [a+ 28 (b−a), a+38 (b−a)]?
The answer is slightly more difficult. The length of the segment is 1/8, the same as the last question.
Since the dart has equal chance to land on any point on [a, b], the probability of having the dart landing
on an eighth of the line [a, b], i.e., the segment [a + 28 (b− a), a +38 (b− a)], has to be 1/8.
6. What is the probability that a dart randomly thrown will end up in the segment [a+ 28 (b−a), a+38 (b−a)]
AND [a + 58 (b− a), a +68 (b− a)]? The answer is not difficult. We have two non-overlapping segments
of equal length. It is impossible for any throw to land on both of these two non-overlapping segments.
Thus, the probability should be 0.
6
-
7. What is the probability that a dart randomly thrown will end up in the segment [a+ 28 (b−a), a+38 (b−a)]
OR [a + 58 (b − a), a +68 (b − a)]? The answer is slightly more difficult. We have two non-overlapping
segments of equal length. The probability should equal the the sum of the probability of landing on the
segment [a+ 28 (b−a), a+38 (b−a)] and the probability of landing on the segment [a+
58 (b−a), a+
68 (b−a)].
That is, it should be 1/8+1/8=2/8.
8. What is the probability that a dart randomly thrown will end up in the segment [a, k], where k < 1?
The answer is slightly more difficult. As we learn from the previous discussions, it is not too difficult
to induce that the probability of having the dart landing on the segment [a, k] is simply (k−a)/(b−a).
9. What is the probability that a dart randomly thrown will end up in the segment [k1, k2], where k1 < k2?
The answer is slightly more difficult. It is not too difficult to induce that the probability of having the
dart landing on the segment [k1, k2] is simply (k2 − k1)/(b− a).
10. What is the probability that a dart randomly thrown will end up in the segment [k1, k2] AND [k3, k4],
where k1 < k2 < k3 < k4? The answer is not difficult. We have two non-overlapping segments. It is
impossible for any throw to land on both of these two non-overlapping segments. Thus, the probability
should be 0.
What is the probability that a dart randomly thrown will end up in the segment [k1, k2] OR [k3, k4], where
k1 < k2 < k3 < k4? The answer is not difficult. We have two non-overlapping segments. The
probability should equal the the sum of the probability of landing on the segment [k1, k2] and the
probability of landing on the segment [k3, k4]. That is, it should be [(k2 − k1) + (k4 − k3)]/(b− a).
1. What is the probability that a dart randomly thrown will end up exactly at a single point k1? The
answer is slightly more difficult. There are infinite uncountable points on the entire segment [a, b]. The
probability of the dart ending up at the point k1 is like the probability of the dart ending up in an
interval with zero length. That is, it should be zero.
2. What is the probability that a dart randomly thrown will end up exactly at one of the two points, k1
or k2? The answer is slightly more difficult. Since the two points are non-overlapping, the probability
should be the sum of the individual ones, i.e., 0 = 0 + 0.
3. What is the probability that a dart randomly thrown will end up exactly at one of the 99 points, k1,
k2, ..., or k99? The answer is simple. Since the 99 points are non-overlapping, the probability should
be the sum of the individual ones, i.e., 0.
7
-
2.3 Deriving the probability density function (pdf)
Based on what we know about discrete probability distribution, it might appear that there is inconsistency
between the notions P (x ∈ [k1, k2]) = (k2 − k1)/(b − a) and P (x = k) = 0. Can we write P (x ∈ [k1, k2])
as a sum of probability of individual events? Yes, with an introduction of the density concept. That is,
we would like to define a density c such that P (x ∈ [a, k]) = (k − a) × c. What would c be? We know
that c must also satisfy P (x ∈ [a, b]) = (b − a) × c = 1, implying c = 1/(b − a). It is not too difficult
to check that the density so defined will generate results that are consistent with the discussions above.
In particular, P (x ∈ [a, k]) = (k − a) × c = (k − a)/(b − a), P (x ∈ [k1, k2]) = (k2 − k1)/(b − a), and
P (x = k) = P (x ∈ [k, k]) = (k − k)/(b− a) = 0.
What we have just discussed is the so called uniform distribution over the interval [a, b].
Definition 1 (Uniform distribution): If a and b are numbers on the real line, the random variable
(X) ∼ U(a, b), i.e., has a uniform distribution if the density function looks like
f(x) =
1
(b−a) for a ≤ x ≤ b
0 otherwise
and the cumulative distribution function (cdf) looks like
F (x) = Prob(X < x) =
0 for x ≤ ax−a(b−a) for a ≤ x ≤ b
1 for b ≤ x
Recall that in discrete case, the expected value of a random variable X with probability mass function
P(X) is
E(X) =∑X
XP (X)
In continuous random variable case, P (X = k) = 0, how do we compute the expected value? We can use
an approximation. Since P (X ∈ [k, k + dx]) > 0, it is possible to imagine the relevant segment is divided
into many smaller intervals of length dx ([a, a + dx], [a + dx, a + 2dx], ..., [a + ( b−adx − 1)dx, a + (b−adx )dx]),
and obtain an approximation with the formula above by replacing P (X = k) with P (X ∈ [k, k + dx]), and
replacing X with one of the three possibilities:
8
-
1. the lower bound of [k, k + dx], i.e, k.
E(X) =( b−adx )∑i=1
(a + (i− 1)dx)P (X ∈ [a + (i− 1)dx, a + idx]) =( b−adx )∑i=1
(a + (i− 1)dx)× c× dx)
2. the upper bound of [k, k + dx], i.e, k + dx.
E(X) =( b−adx )∑i=1
(a + idx)P (X ∈ [a + (i− 1)dx, a + idx]) =( b−adx )∑i=1
(a + idx)× c× dx
3. the mid-point of [k, k + dx], i.e, k + dx/2.
E(X) =( b−adx )∑i=1
(a + (i− 12)dx)P (X ∈ [a + (i− 1)dx, a + idx]) =
( b−adx )∑i=1
(a + (i− 12)dx)× c× dx
The accuracy of such approximation depend on dx. Generally, the smaller dx, the more accurate is the
approximation. To obtain a more accurate approximation, we can imagine dx shrinking towards zero. In
this case the number of terms being added expands to infinity.
E(X) = limdx→0
( b−adx )∑i=1
(a + (i− 1)dx)× c× dx = limdx→0
( b−adx )∑i=1
xi × c× dx =∫ b
a
xcdx
a
(x1)
x
c=1/(b-a)
xi xi+dx
y=f(x)
bx2dx dx
x3dx
9
-
In the expression above, we are really using the integral sign to stand for the limit of the sum.
limdx→0
( b−adx )∑i=1
=∫ b
a
In the above discussion, we consider only the case when the dart will land on an interval [a, b] with
equal chance, i.e., uniform distribution. The discussion can be easily extended to other situations with
the interval [a, b] subdivided into many sub-intervals and the densities are held the same within each sub-
intervals. Let the interval be of length dx as before, we have the intervals [a, a+dx], [a+dx, a+2dx], ..., [a+
( b−adx − 1)dx, a + (b−adx )dx], and the corresponding densities in each intervals can be labelled differently as
f(x1), ..., f(xn), where n = (b − a)/dx. With this information, we can answer many questions similar to
those discussed earlier. Again, at the limiting case with dx approaching zero, the probability will be defined
as
P (x ∈ [k1, k2]) = limdx→0
∑[k1,k2]
fidx =∫ k2
k1
f(x)dx
More generally we can define the cumulative distribbution function (cdf) and use cdf to compute the
probability.
Definition 2 (Cumulative distribbution function (cdf)): Let f(x) be the pdf of the a continuous
random variable X. The cumulative distribbution function (cdf) is
F (x) = P (X < x) =∫ x−∞
f(x)dx
P (x ∈ [k1, k2]) =∫ k2
k1
f(x)dx =∫ k2−∞
f(x)dx−∫ k1−∞
f(x)dx = F (k2)− F (k1)
Given the cdf F (x), we can also derive the pdf f(x)
f(x) =d
dxF (x)
where ddx means “differentiate with respect to x”.
We can check such definitions are consistent at least with the uniform distribution. In fact, it holds for
10
-
any continuous probability distributions.
The above discussions suggest that the density function defines the probability distribution in the con-
tinuous case.
Example 2 (Part-time Work on Campus): A student has been offered part-time work in a
laboratory. The professor says that the work will vary from week to week. The number of hours
will be between 10 and 20 with a uniform probability density function:
1. How tall is the rectangle?
2. What is the probability of getting less than 15 hours in a week?
3. Given that the student gets at least 15 hours in a week, what is the probability that more
than 17.5 hours will be available?
x
c=0.1
y=f(x)
2010
Because the probability is uniformly distributed, the pdf can be illustrated as an rectangle above
and the height of the rectangle is the very uniform probabilty, i.e. 1/(20 − 10) = 0.1. If we
employ the same letters we used previously, we have a = 10, b = 20, and c = 0.1. The pdf is
f(x) =
1
20−10 = 0.1 for 10 ≤ x ≤ 20
0 otherwise
11
-
and the cdf is
F (x) = Prob(X < x) =
0 for x ≤ 10x−1020−10 for 10 ≤ x ≤ 20
1 for 20 ≤ x
Thus Prob(X < 15) = 15−1020−10 = 0.5, and Prob(X > 17.5|X > 15) =Prob(X>17.5&X>15)
Prob(X>15) =Prob(X>17.5)Prob(X>15) =
17.5−1015−10 = 0.5.
Remember that the probability for a continuous random variable to be equal to an exact value
is always zero. Therefore we are always relieved from considering about the boundary of any
intervals, i.e. “” or “≥”.
Example 3 (Customer Complaints): You are the manager of the complaint department for a
large mail order company. Your data and experience indicate that the time it takes to handle a
single call denote ranges from 0 to 15 minutes and is denoted as T and has a rectangular triangle
probability density function, with a height of 2/15.
1. Show that the area under the triangle is 1.
2. Find the probability that a call will take longer than 10 minutes. That is, find P (T > 10).
3. Given that the call takes at least 5 minutes, what is the probability that it will take longer
than 10 minutes? That is, find P (T > 10|T > 5).
4. Find P (T < 10).
We know the area under the pdf curve is the probability and it must be 1 because from 0 to 15
are all possible outcomes. The pdf of the function should be like the follows if we believe the
company tries to handle all calls as soon as possible:
12
-
t
2/15
y=f(t)
15100 5
P (T > 10) should be the area under the pdf curve where 15 > t > 10. Basic geometry helps us
to get that area equals 1/9, i.e. P (T > 10) = 19 = 0.1111.
P (T > 10|T > 5) = P (T>10&T>5)P (T>5) =P (T>10)P (T>5) =
1/94/9 = 0.25.
P (T < 10) = 1− P (T > 10) = 1− 1/9 = 8/9 = 0.8889.
3 Normal distributions
The most popular distribution we will use after this chapter is normal distribution. It pays to understand it
very well.
Definition 3 (Normal distribution): The random variable X ∼ N(µ, σ2), i.e., has a normal
distribution if the random variable is defined on the whole real line [−∞,∞] and has the following
density function
f(x) =1
σ√
2πe−
12 �
2
� =x− µ
σ
where µ and σ are the mean and standard deviation of the random variable (hence σ2 is the
variance) , π = 3.14159..., and e = 2.71828... is the base of natural or Naperian logarithms.
Normal distributions are characterized by its mean and variance. Often, a normal random vari-
13
-
able, X, distributed as normal with mean µ and variance σ2 will be denoted as
X ∼ N(µ, σ2)
where “∼” reads “distributed as”.
Normal distribution has several main characteristics:
1. It is bell-shaped and single-peaked (unimodal) at the exact center of the distribution, µ.
2. Symmetrical about its mean. The arithmetic mean, median, and mode of the distribution
are equal and located at the peak. Thus half the area under the curve is above the mean
and half is below it.
3. The normal probability distribution is asymptotic. That is the curve gets closer and closer
to the X-axis but never actually touches it.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
-5 -4 -3 -2 -1 0 1 2 3 4 5
N(0,0.3)
N(0,0.5)
N(0,1)
N(-1,0.5) N(3,0.5)
N(0,2)
Probabi l i ty
x
Given the density function as described above, unlike uniform distribution, it is not easy to integrate to
obtain probability of a normal random variable lying within a segment. It is not to difficult to get Excel
to do the calculation, even for different combinations of mean and variance. However, many years ago,
computational power was limited. Statistician worked very hard to come up with a table people can easily
read off from it the cumulative distribution function of normal random variable, i.e., P (X < x). Do we
need to have a table for each combination of µ and σ2? It turns out that all normal random variables
can be transformed to standard normal random variables easily. Thus, for those who know how to do the
transformation, we only need one table, the standard normal table.
14
-
Definition 4 (Standard Normal distribution): A standard normal random variable is a normal
random variable with zero mean (µ = 0) and unit standard deviation (σ = 1). Its probability
density is defined on a real line (from −∞ to ∞)
f(x) =1√2π
e−12 x
2
0
0.2
0.4
0.6
0.8
1
1.2
-2.5 -1.5 -0.5 0.5 1.5 2.5
cdf
pdf
Probabi l i ty
x
The standard normal distribution is sometimes known as z-distribution.
Theorem 2 (Transform to Standard Normal Distribution): A linear transformation of a normal
random variable will remain normal. In particular, any normal random variable with mean µ
and standard deviation σ can be transformed to a standard normal random variable
Z =X − µ
σ
Thus, P (X ∈ [a, b]) = P (Z ∈ [(a − µ)/σ, (b − µ)/σ]). With this property, we do not need a separate
probability tables for different µ and σ. Instead, we only need one table — the standard normal table.
Although we can easily calculate probability of such normal random variable of different combinations of
mean and variance lying in an interval with the help of a computer, we need to learn how to use the tables
in exams.
Example 4 (How to read the standard normal distribution table):Suppose we have the following
part of a standard normal distribution table accompanied with a graph:
15
-
z ... 0.03 0.04 0.05 0.06 0.07 ...
... ... ... ... ... ... ... ...
0.7 ... 0.2673 0.2704 0.2734 0.2764 0.2794 ...
0.8 ... 0.2967 0.2995 0.3023 0.3051 0.3078 ...
0.9 ... 0.3238 0.3264 0.3289 0.3315 0.334 ...
1 ... 0.3485 0.3508 0.3531 0.3554 0.3577 ...
1.1 ... 0.3708 0.3729 0.3749 0.377 0.379 ...
... ... ... ... ... ... ... ...
x
0 z
Probability
Let’s first make clear what does this table mean. The most left column together with the top
row combines to make the number we want, i.e. the z in the graph. The inner part of the table
is the resulting probability, i.e. the yellow area in the graph. For example, suppose we want to
know Prob(0 < X < 0.84), we just find the row of 0.8 and the column of 0.04, and then read the
the number in the cell, i.e. 0.2995, or we have Prob(0 < X < 0.84) = 0.2995.
Try the follows:
1. Prob(0 < X < 1.163)
This is the same as our previous example. Because the table gives only approximate values,
we have to round 1.163 to its nearest percentile, 1.16 (i.e. we assume Prob(0 < X <
1.163)=Prob(0 < X < 1.16) approximately). Then we just look the value up from the table
at row 1.1 and column 0.06, which is simply 0.377.
2. Prob(X > 0.77)
First remember one of the very nice properties of normal distribution is its symmertry,
i.e. Prob(X > µ) = Prob(X < µ) = 0.5. This point tells us to get Prob(X > 0.77)
we need just one step more. We first find out that Prob(0 < X < 0.77) = 0.2794. Then
Prob(X > 0.77) = 0.5− 0.2794 = 0.2206.
3. Prob(0.77 < X < 1.16)
We should have no trouble to get Prob(0 < X < 0.77) = 0.2794 and Prob(0 < X < 1.16) =
0.377, as we have done already. Then Prob(0 < X < 1.16) is simply their difference:
Prob(0 < X < 1.16) = 0.377− 0.2794 = 0.0976.
4. Prob(−0.85 < X < 0)
16
-
Symmertry helps again here, telling that Prob(−0.85 < X < 0) = Prob(0 < X < 0.85). We
have no problem finding Prob(0 < X < 0.85), which is actually at row 0.8 and column 0.05,
0.3023.
5. Prob(X < −1.16)
We apply symmertry again to get Prob(X < −1.16) = Prob(X > 1.16) = 0.5 − Prob(0 <
X < 1.16) = 0.123.
6. Prob(−0.85 < X < 0.77)
Since for X it is mutually exclusive to fall into either (−0.85, 0) or (0, 0.77), we know
Prob(−0.85 < X < 0.77) = Prob(−0.85 < X < 0) + Prob(0 < X < 0.77) = 0.3023 +
0.2794 = 0.5817.
7. Sometimes we want reversely use the table to find the particular z instead of the probability.
Suppose Prob(−k < X < k) = 0.75, (k > 0). What is the k here?
Considering the symmertry of normal distribution, 0.75 = Prob(−k < X < k) = 2 ×
Prob(0 < X < k), i.e. Prob(0 < X < k) = 0.375. Therefore we look for 0.375 (or some
other number that is closest to 0.375) in the table. We find on row 1.1 and column 0.05 the
probability is 0.3749. Therefore a reasonably precise result would be k = 1.15.
Recall that all normal distributions with different µand σ, i.e. different X N(µ, σ2) can be
transferred to standard normal through a simple linear transformation:
Z =X − µ
σ
Thus the standard normal distribution table helps us much more than the aforementioned. See
the following more examples:
1. Given X N(1, 0.25), what is Prob(0 < X < 1.5)?
Transfer X to the standard normal Z, we have Prob(0 < X < 1.5) = Prob( 0−10.5 < Z <
1.5−10.5 ) = Prob(−2 < Z < 1). We are capable to calculate this through the above exercises.
Note that the given table in this example is not enough and a full table covering −2and 1
must be found. The answer should be close to Prob(0 < X < 1.5) = 0.8185, as different
tables report at different accuracy.
2. Given X N(−2, 4), and Prob(−1 < X < k) = 0.10. What is k?=0.6915+0.1=0.7915
17
-
Again we do the standard normal transfer first. The lower boundary -1 would be transferred
to −1−(−2)2 = 0.5; the upper boundary k would be1−k2 after the linear transfer. Thus
Prob(−1 < X < 1) = Prob(.5 < Z < 1− k2
) = Prob(0 < Z <1− k
2)−Prob(0 < Z < .5) = 0.1
or
Prob(0 < Z <1− k
2) = 0.1 + Prob(Z < .5)
= 0.1 + 0.1915
= 0.2915
Then we look for the closest number to 0.2915 from a full standard normal table. At ordinary
accuracy requiremnt, 1−k2 = 0.81 would be good enough, or k = 1− 2× 0.81 = −0.62.
Example 5 (Normal distribution): Suppose you work in Quality Control for GE. Light bulb life
has a normal distribution with µ = 2000 hours and σ = 200 hours.
1. What is the probability that a bulb will last Between 2000 and 2400 hours?
P (2000 < X < 2400) = P [(2000− µ)/σ < (X − µ)/σ < (2400− µ)/σ]
= P [0 < (X − µ)/σ < (2400− µ)/σ]
= P [0 < Z < 2]
= 0.4772
2. What is the probability that a bulb will last less than 1470 hours?
P (X < 1470) = P [(X − µ)/σ < (1470− µ)/σ]
= P [Z < −2.65]
= P [Z > 2.65]
= 0.5− P [0 < Z < 2.65]
= 0.5− 4960
= 0.004
18
-
Example 6 (Normal distribution): The daily water usage per person in New Providence, New
Jersey is normally distributed with a mean of 20 gallons and a standard deviation of 5 gallons.
1. About 68 percent of those living in New Providence will use how many gallons of water?
Note that we know that P (µ − σ < X < µ + σ) = 0.6826. Thus, about 68% of the daily
water usage will lie between 15 (µ− σ) and 25 (µ + σ) gallons.
2. What is the probability that a person from New Providence selected at random will use
between 20 and 24 gallons per day?
P (20 < X < 24) = P [(20− 20)/5 < (X − 20)/5 < (24− 20)/5] = P [0 < Z < 0.8]
The area under a normal curve between a z-value of 0 and a z-value of 0.80 is 0.2881. We
conclude that 28.81 percent of the residents use between 20 and 24 gallons of water per day.
3. What percent of the population use between 18 and 26 gallons of water per day?
P (18 < X < 26) = P [(18− 20)/5 < (X − 20)/5 < (26− 20)/5]
= P (−0.4 < Z < 1.2)
= P (−0.4 < Z < 0) + P (0 < Z < 1.2)
= P (0 < Z < 0.4) + P (0 < Z < 1.2)
= 0.1554 + 0.3849
= 0.5403
Example 7 (Normal distribution): Professor Wong has determined that the scores in his statis-
tics course are approximately normally distributed with a mean of 72 and a standard deviation
of 5. He announces to the class that the top 15 percent of the scores will earn an A. What is the
lowest score a student can earn and still receive an A?
To begin let k be the score that separates an A from a B. 15 percent of the students score more
than k, then 35 percent must score between the mean of 72 and k.
1. Write down the relation between k and the probability: P (X > k) = 0.15 and P (X < k) =
1− P (X > k) = 0.85
19
-
2. Transform X into z:
P [(X − 72)/5) < (k − 72)/5] = P [Z < (k − 72)/5]
3. We look for s = (k − 72)/5 such that
P [0 < Z < s] = 0.85− 0.5 = 0.35
From the standar normal table, we find s = 1.04:
P [0 < z < 1.04] = 0.35
4. Compute k: (k − 72)/5 = 1.04 implies k = 77.2
Thus, those with a score of 77.2 or more earn an A.
Example 8 (Stock returns): Let X be daily stock returns (percentage change of daily stock
prices). Suppose X is distributed as normal with mean 0 and variance 4, i.e., X ∼ N(0, 4).
What is the probability that the daily stock returns will lie between -2 and +2?
First make the linear transformation from (−2, 2) N(0, 4) to standard normal Z:
P (−2 < X < 2) = P (−2− 0√4
< Z <2− 0√
4) = P (−1 < Z < 1)
and because of the symmetry of normal distribution,
P (−2 < X < 2) = 2P (0 < Z < 1)
Then we can easily find the probability form the standard normal table:
P (−2 < X < 2) = 2× 0.3413 = 0.6826
The probability that the daily stock returns will lie between -2 and +2 is 0.6826.
Example 9 (Personal income): Let X be monthly personal income (in dollars). Suppose
log(X) is distributed as normal with mean 9 and variance 16, i.e., log(X) ∼ N(9, 16). What is
20
-
the probability that the monthly personal income of a randomly drawn person will be less than
5000? Is there a reason we assume log(X) instead of X to follow a normal distribution?
For simlicity, let us denote Y = log(X) as a new random variable, and we know Y follows a
normal distribution of N(9, 16). Because Y = log(X) is a monotonically increasing function
(i.e. the value of Y does not decrease as X increases), for the interval that X < 5000, we have
Y < log(5000) = 8.5172.
Next we can transfer Y to standard normal, Z:
P (Y < 8.5172) = P (Z <8.5172− 9√
16= −0.12) = 0.5− P (0 < Z < 0.12)
Looking up the standard normal table, the probability is 0.5− 0.0478 = 0.4522.
3.1 Checking for normality
Normal distribution is one of the most important distributions. But, how do we know that data is distributed
as normal, at least approximately? There are at least two ways to check.
First, we can check the moments. We know that normal distributed random variable has zero skewness
(due to symmetry of the distribution) and zero excess kurtosis1. If the sample skewness and excess kurtosis
are close to zero, we will be convinced that the data is likely from a normally distributed population.
Simulation 1 (Checking the skewness and kurtosis for normality):
1. Generate 50 observations from a standard normal distribution N(0, 1), another 50 observa-
tions from uniform distribution U(0, 1). Compute their sample skewness and excess kurto-
sis2.
skewness =√
n∑n
i=1(xi − x̄)3
[∑n
i=1(xi − x̄)2]3/2
kurtosis =n
∑ni=1(xi − x̄)4
[∑n
i=1(xi − x̄)2]2 − 3
2. Repeat last step 1000 times. Report the skewness and excess kurtosis calculation, and the
average skewness over these 1000 repetitions.1For instance, refer to section 16.7 Tests for Skewness and Excess Kurtosis, p.567 of Estimation and Inference in Econometrics
by Davidson and MacKinnon.2Note that a slight adjustment is need to obtain an unbiased estimator of these statistics.
21
-
Below is a group of possible resulted we have generated:
Distribution Observation # Average Skewness Average Excess KurtosisU(0,1) 50 0.0114 -1.1496U(-2,2) 50 -0.0077 -1.1375N(0,1) 50 -0.0008 0.0031N(-2,5) 50 0.0077 -0.0290
The table reports that normal distribution has much closer-to-zero skewness and excess kurtosis
than uniform distribution, and this holds regardless of the means and variances.
[Refer to sim1.xls]
Alternatively, we can use normal probability plot. Suppose we have n observations in the sample.
1. Sort them in ascending order.
2. Compute the empirical z value (i.e., (x− x̄)/σ).
3. Generate a column 0.5, 1.5, ..., [0.5 + (n− 1)]. Call this column U .
4. Generate another column p(z) = U/n.
5. Generate another column theoretical z = NORMSINV (p(z)).
6. Plot empirical z against the theoretical z.
If the data has normal distribution, the plot should be a straight line. We illustrate the steps to do so in the
following example.
Example 10 (normal probability plot):
1. Generate 1000 observations from N(0, 1).
2. Sort them in ascending order.
3. Compute the empirical z value (i.e., (x− x̄)/σ).
4. Generate a column 0.5, 1.5, ..., [0.5 + (n− 1)]. Call this column U .
5. Generate another column p(z) = U/n.
6. Generate another column theoretical z = NORMSINV (p(z)).
7. Plot empirical z against the theoretical z.
22
-
8. Repeat with 1000 observations drawn from U(0, 1).
The two plots are shown below.
Normal probablity plot (underlying population = normal)
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5
Theoretical z value
z va
lue
fro
m d
ata
Normal probablity plot (underlying population = uniform)
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5
Theoretical z value
z va
lue
fro
m d
ata
As obvious from the two plots, the plot lies more or less on the straight line when underlying
population is normal but not when the underlying population is uniformly distributed.
23
-
4 Bivariate distributions
Except for the use of integrations, the bivariate distributions of continuous random variables are not very
different from that of discrete random variables.
Theorem 3 (Characteristics of a Bivariate Continuous Distribution): If X and Y are continuous
random variables that are defined on intervals [a, b] and [c, d] respectively. The joint probability
density is denoted as f(x, y).
1. The probability density function f(x, y) is non-negative and may be larger than 1.
2. The probability that X takes a value between an interval [m,n] and Y between [k, l] is
P (X ∈ [m,n], Y ∈ [k, l]) =∫ x=n
x=m
∫ y=ly=k
f(x, y)dxdy
3. [P (X ∈ [m,n], Y ∈ [k, l]) =∫ x=n
x=m
∫ y=ly=k
f(x, y)dxdy is between 0 and 1.00.
4. The sum of the probabilities of the various outcomes is 1.00. That is,
P (X ∈ [−∞,∞], Y ∈ [−∞,∞]) = P (X ∈ [a, b], Y ∈ [c, d]) =∫ x=b
x=a
∫ y=dy=c
f(x, y)dxdy = 1
5. Let the events defined on the two non-overlapping regions, ([m1, n1], [k1, l1]) and ([m2, n2], [k2, l2])
be X ∈ [m1, n1], and Y ∈ [k1, l1] and X ∈ [m2, n2] and Y ∈ [k2, l2] These two events are
mutually exclusive. That is,
P ((X ∈ [m1, n1] and Y ∈ [k1, l1]) and (X ∈ [m2, n2] and Y ∈ [k2, l2])) = 0.
P ((X ∈ [m1, n1] and Y ∈ [k1, l1]) or (X ∈ [m2, n2] and Y ∈ [k2, l2]))
= P (X ∈ [m1, n1] and Y ∈ [k1, l1]) + P (X ∈ [m2, n2] and Y ∈ [k2, l2])
6. The marginal density function of X is
f(x) =∫ ∞
y=−∞f(x, y)dy =
∫y
f(x, y)dy.
Note that the marginal density function of X is used when we do not care about the values
24
-
Y takes. Similarly the marginal density function of Y is
f(y) =∫ ∞
x=−∞f(x, y)dx =
∫x
f(x, y)dx.
7. The conditional density function of X given Y :
f(x|y) =
f(x, y)/f(y) if f(y) > 00 if f(y) = 0Definition 5 (Independent bivariate uniform distribution): If a, b, c and d are numbers on the
real line, , the random variable (X1, X2) ∼ U(a, b, c, d), i.e., has a independent bivariate uniform
distribution if
f(x1, x2) =
1
(b−a)(d−c) for a ≤ x1 ≤ b and c ≤ x2 ≤ d
0 otherwise
Definition 6 (Bivariate normal distribution): The two random variables (X, Y ) ∼ N(µx, µy, σ2x, σ2y, ρ),
i.e., has a bivariate normal distribution with a correlation coefficient of ρ if the two random vari-
ables are defined jointly on the whole real line [−∞,∞] and has the following density function
f(x, y) =1
2πσxσy√
1− ρ2e−
12 [�
2x+�
2y−2ρ�x�y)/(1−ρ
2)]
�x =x− µx
σx
�y =y − µy
σy
5 Expectations
The expectation plays a central role in statistics and Economics. Expectation (often known as mean) reports
the central location of the data. Sometimes, it is also known as the long-run average value of the random
variable, i.e., the average of the outcomes of many experiments.
25
-
Definition 7 (Expectation (mean)): Let X be a continuous random variable defined over [a, b],
with a probability density function of f(x). The expectation of X is
E(X) =∫ x=∞
x=−∞xf(x)dx =
∫x
xf(x)dx
The expectation, E(X), is often denoted by a Greek letter µ (pronounced as mu).
Thus, expectation of a random variable is a weight average of all the possible values of the random
variable, weighted by its probability density function.
Definition 8 (Conditional Expectation): For bivariate probability distribution, the conditional
expectation or conditional mean E(X|Y = y) is computed by the formula:
E(X|Y = y) =∫
x
xf(x|y)dx
The unconditional expectation or mean of X is related to the conditional mean.
E(X) =∫
y
E(X|Y = y)f(y)dy
= E[E(X|Y )]
Theorem 4 (Expectation of a linear transformed random variable): If a and b are constants and
X is a random variable, then
1. E(a) = a
2. E(bX) = bE(X)
3. E(a + bX) = a + bE(X)
Proof: In our proof, we will only show the most general case E(a + bX) = a + bE(X).
E(a + bX) =∫
x
(a + bx)f(a + bx)dx
=∫
x
(a + bx)f(x)dx
=∫
x
af(x)dx +∫
x
bxf(x)dx
= a∫
x
f(x)dx + b∫
x
xf(x)dx
26
-
= a + bE(X)
Definition 9 (Variance): Let X be a continuous random variable defined over [a, b], with a
probability density function of f(x). The variance of X is
V (X) =∫
x
(x− E(X))2 f(x)dx
The variance, V (X), is often denoted by a Greek letter σ2 (pronounced as sigma square).
Note that variance of a random variable is the expectation of squared deviation of the random variable
from its mean. That is, if we define a tranformed variable as Z = ((X − E(X))2, V (X) = E(Z). Thus,
we will expect the variance of a transformed variable will be similar to the ones of the expectation of a
transformed variable.
Example 11 (Variance of a random variable): Suppose X and Y are jointly distributed random
variables with probability density function f(x, y). The variance of X is
V (X) =∫
y
∫x
(x− E(X))2f(x, y)dxdy
Definition 10 (Conditional Variance): For bivariate probability distribution, the conditional
expectation or conditional mean V (X|Y ) is computed by the formula:
V (X|Y = y) =∫
x
(x− E(X|Y = y))2f(x|y)dx
Theorem 5 (Variance of a linear transformed random variable): If a and b are constants and
X is a random variable, then
1. V (a) = 0
2. V (bX) = b2V (X)
3. V (a + bX) = b2V (X)
Proof: In our proof, we will only show the most general case V (a + bX) = b2V (X).
V (a + bX) = E[(a + bX)− (a + bE(X))]2
27
-
= E[(bX − bE(X)]2
= E[b(X − E(X))]2
= E[b2(X − bE(X)2]
= b2E[(X − bE(X)2]
= b2V (X)
Definition 11 (Covariance): Covariance between two random variables X and Y is
C(X, Y ) = E[(X − E(X))(Y − E(Y ))]
=∫
x
∫y
(x− E(X)(y − E(Y ))PXY (x, y)
Note that the covariance can be written as
C(X, Y ) = E[(X − E(X))(Y − E(Y ))]
= E[XY − E(X)Y −XE(Y ) + E(X)E(Y )]
= E[XY ]− E[E(X)Y ]− E[XE(Y )] + E[E(X)E(Y )]
= E[XY ]− E(X)E(Y )− E(X)E(Y ) + E(X)E(Y )
= E[XY ]− E(X)E(Y )
Theorem 6 (Covariance of a linear transformed random variable): If a and b are constants and
X is a random variable, then
1. C(a, b) = 0
2. C(a, bX) = 0
3. C(a + bX, Y ) = bC(X, Y )
Proof: In our proof, we will only show the most general case C(a + bX, Y ) = bC(X, Y ).
C(a + bX, Y ) = E{[(a + bX)− (a + bE(X))][Y − E(Y )]}
= E{[(bX − bE(X)][Y − E(Y )]}
= E{[b(X − E(X))][Y − E(Y )]}
28
-
= bE{[X − E(X)][Y − E(Y )]}
= bC(X, Y )
Theorem 7 (Variance of a sum of random variables): If a and b are constants, X and Y are
random variables, then
1. V (X + Y ) = V (X) + V (Y ) + 2C(X, Y )
2. V (aX + bY ) = a2V (X) + b2V (Y ) + 2abC(X, Y )
Proof: In our proof, we will only show the most general case V (aX +bY ) = a2V (X)+b2V (Y )+
2abC(X, Y ).
V (aX + bY ) = E[(aX + bY )− (aE(X) + bE(Y ))]2
= E[aX − aE(X) + bY − bE(Y )]2
= E[a(X − E(X)) + b(Y − E(Y ))]2
= E[a2(X − E(X))2 + b2(Y − E(Y ))2 + 2ab(X − E(X))(Y − E(Y ))]
= a2E[(X − E(X))2] + b2E[(Y − E(Y ))2] + 2abE[(X − E(X))(Y − E(Y ))]
= a2V (X) + b2V (Y ) + 2abC(X, Y )
Definition 12 (Independence): Consider two random variables X and Y with joint probability
density f(x, y), marginal probability f(x), f(y), conditional probability f(x|y) and f(y|x).
1. They are said to be independent of each other if and only if
f(x, y) = f(x)× f(y) for all x and y.
X and Y are independent if each cell probability, f(x, y), is the product of the corresponding
marginal probability density function.
2. X is said to be independent of Y if and only if
f(x|y) = f(x) for all x and y.
29
-
3. Y is said to be independent of X if and only if
f(y|x) = f(y) for all x and y.
Theorem 8 (Consequence of Independence): If X and Y are independent random variables, we
will have
E(XY ) = E(X)E(Y )
E(XY ) = E(X)E(Y ) does not always imply that the random variables X and Y are independent.
6 The Normal Approximation to the Binomial
The normal distribution (a continuous distribution) yields a good approximation of the binomial distribution
(a discrete distribution) for large values of n.
Recall for the binomial experiment:
1. There are only two mutually exclusive outcomes (success or failure) on each trial.
2. A binomial distribution results from counting the number of successes.
3. Each trial is independent.
4. The probability is fixed from trial to trial, and the number of trials n is also fixed.
The normal probability distribution is generally a good approximation to the binomial probability distribu-
tion when nπ and n(1−π) are both greater than 5 – because of the Central Limit Theorem (to be discussed
in next Chapter). However, because the normal distribution can take all real numbers (is continuous) but
the binomial distribution can only take integer values (is discrete), we will need to correct for the continuity.
A normal approximation to the binomial should identify the binomial event “8” with the normal interval
“(7.5, 8.5)” (and similarly for other integer values).
30
-
Binomial Event Normal Interval
0 (-0.5,0.5)
1 (0.5,1.5)
2 (1.5,2.5)
3 (2.5,3.5)
... ...
x (x− 0.5, x + 0.5)
... ...
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
B(p,n)=B(.3,100)
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
N(np,np(1-p))=N(30,21)
Example 12 (Continuity correction in normal approximation of binomial): If n = 20 and
31
-
π = .25, what is the probability that X is greater than or equal to 8?
The normal approximation without the continuity correction factor yields
Z =(8− 20× .25)
(20× .25× .75)0.5= 1.55
, P (X ≥ 8) = P (Z ≥ 1.55) is approximately .0606 (from the standard normal table).
The continuity correction factor requires us to use 7.5 in order to include 8 since the inequality
is weak and we want the region to the right.
Z =(7.5− 20× .25)
(20× .25× .75)0.5= 1.29
P (X ≥ 7.5) = P (Z ≥ 1.29) is .0985. The exact solution from binomial distribution function is
.1019. Thus, the normal approximation with continuity correction yiled a good approximation to
binomial.
Example 13 (Continuity correction in normal approximation of binomial): A recent study by
a marketing research firm showed that 15% of American households owned a video camera. For
a sample of 200 homes, how many of the homes would you expect to have video cameras?
1. Compute the mean: µ = nπ = 200× .15 = 30.
2. Compute the variance: σ2 = nπ(1− π) = 200× .15× (1− .15) = 25.5. Standard deviation
is σ =√
25.5 = 5.0498.
3. “Less than 40” means “less or equal to 39”. We use the correction factor, so X is 39.5. Hence,
P (X < 39.5) = P [(X−30)/5.0498 < (39.5−30)/5.0498] = P [Z < 1.88] = P [Z < 0]+P [0 <
Z < 1.88] = .5 + .4699 = .9699.
Thus, the likelihood that less than 40 of the 200 homes have a video camera is about 97%.
7 The exponential distribution
Exponential distribution is often used to model the length of time between the occurrences of two events or
between two ocurrences of the same event (the time between arrivals).
Example 14 (Exponential distribution):
32
-
1. Time taken for your instructor to respond to your email.
2. Time between the birth of two babies.
3. Time taken to find a new job since layoff – the so-called unemployment spell.
4. Time to complete a wage bargaining between a company and a labor union.
5. Time to complete the accession to WTO.
6. Time between two major floods in Sichuan, China.
7. Time taken for the police to solve a crime case.
8. Time taken for obtain a job promotion.
Definition 13 (Exponential distribution):
The exponential random variable T (T > 0) has a probability density function
f(t) = λe−λt for t > 0
where λ is the mean number of occurrences per unit time; t is the length of time until the next
occurrence; e = 2.71828.
The cumulative distribution function (the probability that an arrival time is less than some
specified time t) is
F (t) = Prob(T < t) = 1− e−λt
The mean and variance of an exponential random variable are
E(T ) =1λ
V ar(T ) =1λ2
Note that exponential random variable requires only one parameter, its mean λ (lambda). What are the
other distributions that are completely characterized by one parameter?
Example 15 (Exponential distribution): Customers arrive at the service counter at the rate of
15 per hour. What is the probability that the arrival time between consecutive customers is less
than three minutes?
33
-
Let T be the arrival time between consecutive customers. The mean number of arrivals per hour
is 15, so λ = 15. Three minutes is .05 hours. Hence, we have
P (T < .05) = 1− e−λt = 1− e−(15)(.05) = 0.5276
So there is a 52.76% probability that the arrival time between successive customers is less than
three minutes.
Example 16 (Exponential distribution): On average, it takes about seven years to accede to
the GATT/WTO. What is the probability that a country takes more than 15 years to accede to
GATT/WTO?
Let T be the time taken. Since E(T ) = 1λ = 7, we have λ =17 . Hence, we have
P (T < 15) = 1− e−λt = 1− e−( 17 )(15) = 0.883
P (T ≥ 15) = 1− P (T < 15) = e−( 17 )(15) = 0.117
Thus, there is a 11.7% probability that it takes more than 15 years to accede to GATT/WTO.3
[Still need to write a lot more numerical examples for different sections!!]
[Plot some figures associated with the numerical examples!!]
3It took China 15 years to accede to GATT/WTO.
34
-
References
[1] Davidson, Russell, and James G. MacKinnon (1993): Estimation and Inference in Econometrics. Oxford
University Press.
35
-
Mathematical appendix (A brief introduction to integration)
Integration is a useful tool to estimate the area under a curve above zero. Suppose we have a curve defined
by the function of f(x). Suppose we want to compute the area of f(x) above 0 between a and b. Let’s
consider several cases.
1. Let’s start from the simpliest case when f(x)is a constant, i.e., f(x) = k. In this case, the area is
actually a rectangle. It is easy to compute the area as (b− a)× k.
2. Suppose f(x) = k1 for x ∈ [a, c1] and f(x) = k2 for x ∈ [c1, b]. In this case, the area is actually a sum
of two rectangles. It is still easy to compute the area as (b− c1)× k2 + (c1 − a)× k1.
3. Suppose f(x) = k1 for x ∈ [a, c1], f(x) = k2 for x ∈ [c1, c2]and f(x) = k3 for x ∈ [c2, b]. In
this case, the area is actually a sum of three rectangles. It is still easy to compute the area as
(b− c1)× k1 + (c2 − c1)× k2 + (c2 − a)× k3.
The three examples illustrates that how one can compute the area if the area is a combination of rectangles.
For a general f(x), we can approximate the area by a sum of rectangles. Let’s define
g(x) =
k1 for x ∈ [a, a + dx]
k2 for x ∈ [a + dx, a + 2dx]
k3 for x ∈ [a + 2dx, a + 3dx]
.... ....
or
g(x) = km for x ∈ [a + (m− 1)dx, a + mdx]
If g(x) approximate f(x) well, the approximated area is a sum of rectangles
k1 × dx + k2 × dx + k3 × dx + ...
What are the values of k1, k2, ...? One possibility is to take k1 = f(a). That is,
g(x) = f(a + (m− 1)dx) for x ∈ [a + (m− 1)dx, a + mdx]
36
-
When the width of each rectangle is small, it does not matter if we take f(a + (m− 1)dx) or f(a + mdx) or
any point in [a + (m− 1)dx, a + mdx].
Thus, the area may be approximated by
f(a)× dx + f(a + dx)× dx + f(a + 2dx)× dx + ...
The approximation is better if dx can be made to be close to zero. The area will be a sum of infinite
very small rectangles. We write the area as ∫ ba
f(x)dx
Mathematicians have developed neat ways to approximate this area. All we need is to borrow those
formula from them.
Theorem 9 (Some integration formula):
1. Let f(x) = k, a constant (such as the density of a uniform distribution). The area under
the curve f(x) in the interval [a, b] is
∫ ba
f(x)dx = k∫ b
a
dx = k(x|b − x|a) = k(b− a)
where g(x)|a = g(a).
2. Let f(x) = kx, a liner curve (such as the product of the random variable and the density of
a uniform distribution). The area under the curve f(x) in the interval [a, b] is
∫ ba
f(x)dx = k∫ b
a
xdx = k(x2
2|b −
x2
2|a) = k(
b2
2− a
2
2)
where g(x)|a = g(a).
Problem sets
We have tried to include some of the most important examples in the text. To get a good understand of
the concepts, it is most useful to re-do the examples and simulations in the text. Work on the following
problems only if you need extra practice or if your instructor assigns them as an assignment. Of course, the
more you work on these problems, the more you learn.
37
-
Challenge 1 (mean and standard deviation of a uniform distribution): Show that the mean and
standard deviation of a uniform random variable X are
E(X) =c + d
2
and
V (X) =(c− d)2
12
Challenge 2 (mean and standard deviation of a uniform distribution): Show that the exponen-
tial distribution is memoryless, i.e., P (T > s + t|T > s) = P (T > t) for all s, t ≥ 0.
Challenge 3 (mean and standard deviation of an univarate uniform distribution): Let X be
uniformly distributed on the interval [a, b]. Find E(X) and V ar(X).
Challenge 4 (mean and standard deviation of a bivariate uniform distribution): Let (X, Y ) be
jointly uniformly distributed on the rectangle [a, b, c, d]. Find E(X) and V ar(X).
Solutions to problem set
1. It is very straight-forward to guess that the mean of a uniform distribution is simply c+d2 because at
any point between the interval (c, d) the point probability is the same, i.e. f(x) ≡ 1d−c , x ∈ (c, d).
However we can still prove it mathematically:
E(X) =∫ d
c
xf(x)dx
=∫ d
c
x
d− cdx
=1
d− c
∫ dc
xdx
=1
d− c× x
2
2|dc
=1
d− c× (d− c)(d + c)
2
=c + d
2
38
-
and the variance:
V (X) =∫ d
c
(x− µ)2f(x)dx
=1
d− c
∫ dc
(x2 − 2µx + µ2)dx
=1
d− c× (x
3
3− µx2 + µ2x)|dc
=d2 + cd + c2
3− µ(c + d) + µ2
=d2 + cd + c2
3− (c + d)
2
2+
(c + d)2
4
=(4d2 + 4cd + 4c2)− (6c2 + 12cd + 6d2) + (3c2 + 6cd + 3d2)
12
=(c− d)2
12
2. “Memoryless” is a property of exponential distribution. For example, suppose at a specific crossroad,
every λ hours there is a traffic accident. (a) What is the probability that another accident would happen
in next t hours? (b) Given that s accidents had been witnessed last week, what is the probability that
another accident would happen in next t hours?
(a) P (T > t) = 1− (1− e−λt) = e−λt;
(b) P (T > s + t|T > s) = [1− (1− e−λ(s+t))]/[1− (1− e−λs)] = e−λ(s+t)/e−λs = e−λt.
We find that P (T > t) ≡ (T > s + t|T > s), which means that it matters not how many accidents
had happened before - in another word, memoryless - to determine the future probability of another
accident. (Similar property could be found in “Geomertric distribution”).
3. Refer to 1.
4. We may give a reasonable guess that a single point probability of the bivariate uniform distribution
is 1(b−a)(d−c) , i.e fxy(x, y) ≡1
(b−a)(d−c) , x ∈ (a, b), y ∈ (c, d). Then we can run mathematically to the
expectation:
E(X) = E[E(X|Y )]
= E[∫ b
a
xfx(x|y)dx|Y ]
= E(a + b
2|Y )
39
-
=∫ d
c
a + b2
dy
=(a + b)(d− c)
2
and the variance is
V (X) =
40
Features of a Continuous Probability DistributionMaking a connection between discrete and continuous distributionsImagine throwing a dart at [0,1]Imagine throwing a dart at [a,b]Deriving the probability density function (pdf)
Normal distributionsChecking for normality
Bivariate distributionsExpectationsThe Normal Approximation to the BinomialThe exponential distribution