1.1 random variables, probability distributions, and proba ...rrajathe/web/ln_0.pdf · 1.1.3...

Review

Handout 11

1 Probability and Random Variables

1.1 Random Variables, Probability Distributions, and Proba-bility Densities

1.1.1 Fundamentals of Probability

The sample space S of an experiment or an action is the set of all possible outcomes.Each possible outcome is called a sample point. An event is a set of outcomes, or a subsetof the sample space. For an event E , we shall use Pr{E} to denote the probability of E .We first present the axioms that a probability measure must satisfy.2

Axioms of probability: Let S be the sample space and E ,F ∈ S be events.

1. Pr{S} = 1.

2. 0 ≤ Pr{E} ≤ 1.

3. If E and F are disjoint, i.e. E ∩ F = ∅, then Pr{E ∪ F} = Pr{E}+ Pr{F}.

The above axioms yield the following basic identities.

• If Ec = S − E , then

Pr{Ec} = 1− Pr{E}Proof: Since E and Ec are disjoint and their union is S, from statement 3 of theaxiom,

Pr{S} = Pr{E}+ Pr{Ec}.From statement 1, we obtain the desired property, i.e.

1 = Pr{E}+ Pr{Ec}. ¤

• If E and F are not disjoint, then

Pr{E ∪ F} = Pr{E}+ Pr{F} − Pr{E ,F}

Proof: Since (F ∩ E) ∪ (F − E) = F and (F ∩ E) ∩ (F − E) = ∅,

Pr{F} = Pr{F , E}+ Pr{F − E}⇒ Pr{F − E} = Pr{F} − Pr{F , E}.

Since E ∪ (F − E) = E ∪ F and E ∩ (F − E) = ∅,

Pr{E ∪ F} = Pr{E}+ Pr{F − E}︸︷︷︸=Pr{F}−Pr{F ,E}

= Pr{E}+ Pr{F} − Pr{E ,F}.¤

1Course notes were prepared by Prof. R.M.A.P. Rajatheva and revised by Dr. Poompat Saengudom-lert.

2It is common to write Pr{E ∩ F} as Pr{E ,F}. We shall adopt this notation.

1

• If F1, . . . ,Fn are disjoint, then

Pr

{n⋃

i=1

Fi

}=

n∑i=1

Pr{Fi}

Proof: The statement follows from induction. For example, consider n = 3. SinceF1 ∪ F2 and F3 are disjoint, we can write

Pr{F1 ∪ F2 ∪ F3} = Pr{F1 ∪ F2}+ Pr{F3}.

Since F1 and F2 are disjoint, we can write

Pr{F1 ∪ F2 ∪ F3} = Pr{F1}+ Pr{F2}+ Pr{F3}. ¤

The conditional probability of event E given that event F happens (or in short givenevent F), denoted by Pr{E|F}, is defined as

Pr{E|F} =Pr{E ,F}Pr{F}

Alternatively, we can write

Pr{E ,F} = Pr{E|F}Pr{F}

A partition of E is a set of disjoint subsets of E whose union is equal to E . LetF1, . . . ,Fn be a partition of S. From the definition of conditional probability, we canobtain the Bayes’ rule, which is written as

Pr{E} =n∑

i=1

Pr{E|Fi}Pr{Fi}

Proof: Write

E =n⋃

i=1

(E ∩ Fi).

Since F1, . . . ,Fn are disjoint, so are the sets E ∩ F1, . . . , E ∩ Fn. Hence,

Pr{E} =n∑

i=1

Pr{E ,Fi}.

Using the definition of conditional probability, we can write

Pr{E} =n∑

i=1

Pr{E|Fi}Pr{Fi}. ¤

The Bayes’ theorem states that

Pr{Fi|E} =Pr{E|Fi}Pr{Fi}∑n

j=1 Pr{E|Fj}Pr{Fj}

2

Proof: Write Pr{Fi|E} as

Pr{Fi|E} =Pr{Fi, E}

Pr{E} =Pr{E|Fi}Pr{Fi}

Pr{E}and use the Bayes rule for the denominator. ¤

The conditional probability can be defined based on multiple events. In particular,we define

Pr{E|F1, . . . ,Fn} =Pr{E ,F1, . . . ,Fn}Pr{F1, . . . ,Fn}

It follows that we can write

Pr{F1, . . . ,Fn} = Pr{Fn|F1, . . . ,Fn−1}Pr{F1, . . . ,Fn−1}= Pr{Fn|F1, . . . ,Fn−1}Pr{Fn−1|F1, . . . ,Fn−2}Pr{F1, . . . ,Fn−2}= . . .

= Pr{Fn|F1, . . . ,Fn−1} . . . Pr{F3|F1,F2}Pr{F2|F1}Pr{F1}

yielding

Pr{F1, . . . ,Fn} = Pr{F1}n∏

i=2

Pr{Fi|F1, . . . ,Fi−1}

Events E and F are independent if

Pr{E ,F} = Pr{E}Pr{F}

or equivalently

Pr{E|F} = Pr{E}In addition, events E and F are conditionally independent given event G if

Pr{E ,F|G} = Pr{E|G}Pr{F|G}

1.1.2 Random Variables

A random variable is a mapping that assigns a real number X(s) to each sample point sin the sample space S.

• If S is countable, then X(s) is a discrete random variable.

• If S is uncountable, making X(s) take any real value in its range, then X(s) is acontinuous random variable.

The basic idea behind a random variable is that we can consider probabilistic eventsas numerical-valued events, which lead us to a probability function. With this function,we can neglect the underlying mapping from s to X, and consider a random variable Xas a direct numerical outcome of a probabilistic experiment or action.

3

1.1.3 Probability Functions

By using a random variable X, we can define numerical-valued events such as X = x andX ≤ x for x ∈ R. The probability function

FX(x) = Pr{X ≤ x}is known as the cumulative distribution function (CDF) or simply the distribution func-tion. Note that the CDF is defined for all x ∈ R.

• It is customary to denote a random variable by an upper-case letter, e.g. X, anddenote its specific value by a lower-case letter, e.g. x.

• The nature of the function FX(x) is determined by random variable X, which isidentified in the subscript. When the associated random variable X is clear fromthe context, we often write F (x) instead of FX(x).

• Since FX(x) indicates a probability value, it is dimensionless.

Some Properties of a CDF

1. FX(−∞) = 0

2. FX(∞) = 1

3. If x1 < x2, then FX(x1) ≤ FX(x2).

4. Pr{X > x} = 1− FX(x)

5. Pr{x1 < X ≤ x2} = FX(x2)− FX(x1)

An alternative description of the probability distribution of random variable X isprovided by the probability density function (PDF) defined as

fX(x) =dFX(x)

dx

NOTE: A common mistake is to think that fX(x) = Pr{X = x}; it is not always true.

Some Properties of a PDF

1.∫∞−∞ fX(x)dx = 1

2. fX(x) ≥ 0

3. FX(x) =∫ x

−∞ fX(u)du

4. Pr{x1 < X ≤ x2} =∫ x2

x1fX(x)dx

Over all, the PDF fX(x) or CDF FX(x) provides a complete description of randomvariable X.

4

1.1.4 Continuous vs. Discrete Random Variables

Roughly speaking, a continuous random variable has a continuous CDF. A discrete ran-dom variable has a staircase CDF. A mixed-type random variable has a CDF containingdiscontinuities, but the CDF is not necessarily constant between discontinuities. Fig-ure 1.1 illustrates different types of CDFs.

CDF PDF

continuous

discrete

mixed−type

Figure 1.1: CDFs and PDFs of different types of random variables.

Since the PDF is the derivative of the CDF, a continuous random variable has acontinuous PDF. However, the PDF of a discrete or mixed-type random varaible containsimpulses due to the discontinuities in the CDF.

PMF

For a discrete random variable, let X denote the countable set of all possible values ofX(s). We can then define a probability mass function (PMF) as

fX(x) = Pr{X = x}

where x ∈ X . Note that a PMF is only meaningful for a discrete random variable. Thesame notation fX(x) is used for both the PDF and the PMF; it is usually clear from thecontext which type of function is referred to by fX(x).

Example 1.1 : Consider rolling of a dice. The set of sample points of this probabilis-tic experiment is S = {1, 2, 3, 4, 5, 6}. The natural definition of an associated randomvariable is

X(s) = s, s ∈ S.

5

The corresponding PMF is

fX(x) = 1/6, x ∈ {1, . . . , 6}.

The corresponding PDF is

fX(x) =1

6

6∑i=1

δ(x− i).

The corresponding CDF is

FX(x) =1

6

6∑i=1

u(x− i).

Figure 1.2 illustrates the PDF and the CDF for this example. ¤

0

1 2 3 4 5 6

1/6

2/6

1

1/6

Figure 1.2: PDF and CDF of the result of a dice roll.

6

Review

Handout 21

1.1.5 Joint and Conditional CDFs and PDFs

The joint CDF of random variables X and Y is defined as

FXY (x, y) = Pr{X ≤ x, Y ≤ y}Their joint PDF is defined as

fXY (x, y) =∂2FXY (x, y)

∂x∂y

It follows that

Pr{x1 < X ≤ x2, y1 < Y ≤ y2} =

∫ y2

y1

∫ x2

x1

fXY (x, y)dxdy.

The PDF for X (or Y ) alone is called a marginal PDF of X (or Y ) and can be foundfrom the joint PDF by integrating over the other random variable, i.e.

fX(x) =

∫ ∞

−∞fXY (x, y)dy, fY (y) =

∫ ∞

−∞fXY (x, y)dx.

X and Y are statistically independent (or in short independent) if

fXY (x, y) = fX(x)fY (y) for all pairs (x, y)

The conditional PDF of Y given X is defined as

fY |X(y|x) =fXY (x, y)

fX(x)

Note that, if X and Y are independent, then fY |X(y|x) = fY (y).

Example 1.2 : Suppose that fXY (x, y) = 14e−|x|−|y|. The marginal PDF of X is

fX(x) =

∫ ∞

−∞

1

4e−|x|−|y|dy =

1

4e−|x|

∫ ∞

−∞e−|y|dy

=1

2e−|x|

∫ ∞

0

e−ydy =1

2e−|x| × −e−y

∣∣∞0︸︷︷︸

=1

=1

2e−|x|.

Suppose that we want to evaluate Pr{X ≤ 1, Y ≤ 0}. It can be done as follows.

Pr{X ≤ 1, Y ≤ 0} =

∫ 0

−∞

∫ 1

−∞

1

4e−|x|−|y|dxdy

=1

4

(∫ 1

−∞e−|x|dx

)(∫ 0

−∞e−|y|dy

)

=1

4

(ex|0−∞ − e−x

∣∣10

)

︸︷︷︸=2−e−1

(ey|0−∞

)︸︷︷︸

=1

=1

4

(2− 1

e

)¤


1

1.2 Functions of Random Variables

Consider a random variable Y that is obtained as a function of another random variableX. In particular, suppose that Y = g(X). We first consider g that is monotonic (eitherincreasing or decreasing).

Monotonic Functions

If g is monotonic, each value y of Y has a unique inverse denoted by g−1(y), as illustratedin figure 1.3.

Figure 1.3: Monotonic function of random variable X.

When g is monotonically increasing,

FY (y) = Pr{Y ≤ y} = Pr{X ≤ g−1(y)} =

∫ g−1(y)

−∞fX(x)dx,

yielding

fY (y) =dFY (y)

dy= fX(g−1(y))

dg−1(y)

dy.

Similarly, when g is monotonically decreasing,

fY (y) = −fX(g−1(y))dg−1(y)

dy.

It follows that, for a monotonic function g, we have

fY (y) = fX(g−1(y)) ·∣∣∣∣dg−1(y)

dy

∣∣∣∣

Example 1.3 : Let Y = g(X), where g(x) = ax + b. Then, g−1(y) = (y− b)/a, yieldingdg−1(y)/dy = 1/a. It follows that

fY (y) = fX

(y − b

a

)·∣∣∣∣1

a

∣∣∣∣ =1

|a|fX

(y − b

a

). ¤

2

Figure 1.4: Nonmonotonic function of random variable X.

Nonmonotonic Functions

If g is not monotonic, then several values of x can correspond to a single value of y, asillustrated in figure 1.4.

We can view g as having multiple monotonic components g1, . . . , gK , where K is thenumber of monotonic components, and sum the PDFs from these components, i.e.

fY (y) =K∑

k=1

fX(g−1k (y)) ·

∣∣∣∣dg−1

k (y)

dy

∣∣∣∣

Example 1.4 : Let Y = g(X), where g(x) = ax2 with a > 0, as illustrated in figure 1.5.

0

Figure 1.5: Y = aX2 with a > 0.

Each value of y > 0 corresponds to two values of x, i.e.

x =

{g−11 (y) = −

√y/a

g−12 (y) =

√y/a

Note that ∣∣∣∣dg−1

1 (y)

dy

∣∣∣∣ =

∣∣∣∣dg−1

2 (y)

dy

∣∣∣∣ =

∣∣∣∣1

2a

(y

a

)−1/2∣∣∣∣ =

1

2√

ay.

It follows that

fY (y) =1

2√

ay

[fX

(−

√y

a

)+ fX

(√y

a

)]for y ≥ 0. ¤

3

Review

Handout 31

1.3 Expected Values

While the PDF or CDF is a complete statistical description of a random variable, weoften do not need the whole statistical information. More specifically, it is often sufficientto talk about the mean, the variance, the covariance, and so on, as will be described next.

Mean (Expected Value) of a Random Variable

The mean or expected value of random variable X is defined as

E[X] =

∫ ∞

−∞xfX(x)dx

where E[·] denotes the operator for taking the expected value of a random variable. Forconvenience, we also denote E[X] by X.

Suppose that Y = g(X), i.e. Y is a function of X. One way to find E[Y ] is to firstcompute fY (y) and then compute E[Y ] =

∫∞−∞ yfY (y)dy. However, it is often easier to

use the following identify.

E[Y ] = E[g(X)] =

∫ ∞

−∞g(x)fX(x)dx

Another useful property in taking the expectation is the linearity property, whichfollows directly from the linearity of the integration operation. In particular, for anyrandom variables X1, . . . , XN and any real numbers a1, . . . , aN ,

E

[N∑

n=1

anXn

]=

N∑n=1

anE [Xn]

The kth moment of random variable X is defined as E[Xk]. The kth central momentof X is defined as E[(X −X)k]. Some of the commonly used parameters are listed below.

• Mean of X denoted by E[X] or X: Note that the mean of X is equal to the 1stmoment of X.

• Mean square of X denoted by E[X2]: The mean square of X is equal to the 2ndmoment of X. More specifically,

E[X2] =

∫ ∞

−∞x2fX(x)dx


1

• Variance of X denoted by var[X] or σ2X : The variance of X is equal to the 2nd

central moment of X. More specifically,

var[X] =

∫ ∞

−∞(x−X)2fX(x)dx

where var[·] denotes the operator for taking the variance of a random variable.

• Standard deviation of X denoted by σX : The standard deviation of X is equal tothe positive square root of the variance.

Note that the mean E[X] can be thought of as the best guess of X in terms of the meansquare error. In particular, consider the problem of finding a number a that minimizesthe mean square error MSE = E[(X − a)2]. We show below that the error is minimizedby setting a = E[X]. In particular, solving dMSE/da = 0 yields

0 =d

da

(E[X2 − 2aX + a2]

)=

d

da

(E[X2]− 2aE[X] + a2]

)= −2E[X] + 2a,

or equivalently a = E[X].Roughly speaking, the variance σ2

X measures the effective width of the PDF aroundthe mean. We next provide a more quantitative discussion on the variance.

Theorem (Markov inequality): For a nonnegative random variable X,

Pr{X ≥ a} ≤ E[X]

a.

Proof: Pr{X ≥ a} =∫∞

afX(x)dx ≤ ∫∞

axafX(x)dx ≤ 1

a

∫∞0

fX(x)dx = E[X]a

. ¤

Theorem (Chebyshev inequality): For a random variable X,

Pr{|X − E[X]| ≥ b} ≤ σ2X

b2.

Proof: Take |X − E[X]|2 as a random variable in the Markov inequality. ¤

Figure 1.6 illustrates how Pr{|X − E[X]| ≥ b} in the Chebyshev inequality is equalto the area under the “tails” of the PDF. In particular, for b = 2σX , we have

Pr{|X − E[X]| ≥ 2σX} ≤ 1

4,

which means that we can expect at least 75% of observations on random variable X tobe within the range E[X]± 2σX . Thus, the smaller the variance, the smaller the spreadof likely values.

2

Figure 1.6: Area under the PDF tails for the Chebyshev inequality.

To help compute the variance σ2X , the following identify is sometimes useful.

σ2X = E[X2]−X

2

The above identity can be obtained by writing

σ2X = E[(X −X)2] = E[X2 − 2XX + X

2] = E[X2]− 2X

2+ X

2.

Example 1.5 : Consider the Laplace PDF

fX(x) =1

2e−|x|,

which is an even function. The mean, mean square, and standard deviation are computedas follows.

E[X] =

∫ ∞

−∞x · 1

2e−|x|dx = 0

E[X2] = 2

∫ ∞

0

x2 · 1

2e−xdx = 2

σX =

√E[X2]−X

2=√

2− 0 =√

2

Finally, we compute Pr{|X − E[X]| < 2σX} below.

Pr{|X − E[X]| < 2σX} = 2

∫ 2√

2

0

1

2e−xdx ≈ 0.94

Note that the value 0.94 is higher than the lower bound of 0.75 given by the Chebyshevinequality. ¤

Multivariate Expectations

Consider a function g(X,Y ) of two random variables X and Y . Then,

E[g(X,Y )] =

∫ ∞

−∞

∫ ∞

−∞g(x, y)fX,Y (x, y)dxdy

More generally, for a function g(X1, . . . , XN) of N random variables X1, . . . , XN ,

E[g(X1, . . . , XN)] =

∫ ∞

−∞· · ·

∫ ∞

−∞g(x1, . . . , xN)fX1,...,XN

(x1, . . . , xN)dx1 · · · dxN

3

When g(X, Y ) = XY , we have

E[XY ] =

∫ ∞

−∞

∫ ∞

−∞xyfX,Y (x, y)dxdy.

In addition, if X and Y are independent, i.e. fX,Y (x, y) = fX(x)fY (y), we can write

E[XY ] =

(∫ ∞

−∞xfX(x)dx

)(∫ ∞

−∞yfY (y)dy

)= E[X]E[Y ].

Thus, for independent random variables X and Y ,

E[XY ] = E[X]E[Y ] for independent X and Y

Sum of Random Variables

Let Z = X + Y , where X and Y are random variables. The mean and the variance of Zare computed below.

E[Z] = E[X] + E[Y ] = X + Y

σ2Z = E[(X + Y − (X + Y ))2] = E[((X −X) + (Y − Y ))2]

= E[(X −X)2 + (Y − Y )2 + 2(X −X)(Y − Y )]

= σ2X + σ2

Y + 2E[(X −X)(Y − Y )]

= σ2X + σ2

Y + 2(E[XY ]−X Y )

For random variables X and Y , the correlation between X and Y is defined as

RXY = E[XY ]

The covariance between X and Y is defined as

CXY = E[(X −X)(Y − Y )] = E[XY ]−X Y

The covariance normalized by the respective standard deviations is called the correlationcoefficient of X and Y , which is written as

ρXY =CXY

σXσY

.

It is left as an exercise to show that −1 ≤ ρXY ≤ 1.Random variables X and Y are uncorrelated if CXY = 0. Note that, if X and Y are

uncorrelated, then the variance of Z = X + Y is

σ2Z = σ2

X + σ2Y .

In general, for a sum of N random variables X1 + . . . + XN ,

E

[N∑

n=1

Xn

]=

N∑n=1

E[Xn]

In addition,

var

[N∑

n=1

Xn

]=

N∑n=1

var[Xn] for uncorrelated X1, . . . , XN

Finally, recall that E[XY ] = E[X]E[Y ] for independent X and Y . It follows thatindependent random variables are uncorrelated. However, the converse is not true ingeneral.

4

1.4 Real and Complex Random Vectors and Their Functions

1.4.1 Real Random Vectors

A real random vector is a vector of random variables. In particular, let X = (X1, . . . , XN),where X1, . . . , XN are random variables. By convention, a real random vector is a columnvector. The statistics of X is fully described by the joint CDF of X1, . . . , XN , i.e.

FX(x) = FX1,...,XN(x1, . . . , xN) = Pr{X1 ≤ x1, . . . , XN ≤ xN},

or the joint PDF of X1, . . . , XN , i.e.

fX(x) = fX1,...,XN(x1, . . . , xN) =

∂NFX1,...,XN(x1, . . . , xN)

∂x1 · · · ∂xN

.

The mean vector of a real random vector X is defined as

X =(X1, . . . , XN

)

The correlation matrix of X is defined as

RX = E[XXT

]=

RX1X1 · · · RX1XN

.... . .

...RXNX1 · · · RXNXN

The covariance matrix of X is defined as

CX = E[(X−X)(X−X)

T]

=

CX1X1 · · · CX1XN

.... . .

...CXNX1 · · · CXNXN

Note that the diagonal entries of CX are the variances of X1, . . . , XN . In addition, ifX1, . . . , XN are uncorrelated, then CX is a diagonal matrix.

1.4.2 Complex Random Variables

A complex random variable Z is defined in terms of two random variables X and Y as

Z = X + iY.

The mean of Z isE[Z] = Z = X + iY ,

while the variance of Z isσ2

Z = E[∣∣Z − Z

∣∣2].

The covariance of two complex random variables Z1 and Z2 is defined as

CZ1Z2 = E[(Z1 − Z1)(Z2 − Z2)

∗]

5

1.4.3 Functions of Random Vectors

Consider N random variables X1, . . . , XN . Let Y1, . . . , YN be functions of X1, . . . , XN . Inparticular,

Yn = gn(X1, . . . , XN), n = 1, . . . , N.

Let X = (X1, . . . , XN) and Y = (Y1, . . . , YN). In addition, let g(x) = (g1(x), . . . , gN(x)).Assuming that g is invertible, then the joint PDF of Y can be written in terms of thejoint PDF of X as

fY(y) = |J(y)|fX(g−1(y))

where J(y) is the Jacobian determinant

J(y) =

∣∣∣∣∣∣∣

∂g−11 (y)/∂y1 · · · ∂g−1

1 (y)/∂yN...

. . ....

∂g−1N (y)/∂y1 · · · ∂g−1

N (y)/∂yN

∣∣∣∣∣∣∣

and g−1(y) is the inverse function vector

g−1(y) = (g−11 (y), . . . , g−1

N (y)).

Suppose that there are multiple solutions of x for y = g(x). We can view g as havingmultiple components g1, . . . ,gK . It follows that

fY(y) =K∑

k=1

|Jk(y)|fX(g−1k (y))

where Jk(y) is the Jacobian determinant

Jk(y) =

∣∣∣∣∣∣∣

∂g−1k,1(y)/∂y1 · · · ∂g−1

k,1(y)/∂yN

.... . .

...∂g−1

k,N(y)/∂y1 · · · ∂g−1k,N(y)/∂yN

∣∣∣∣∣∣∣

and g−1k (y) is the inverse function vector

g−1k (y) = (g−1

k,1(y), . . . , g−1k,N(y)).

Example 1.6 : Suppose that we know the joint PDF fX1,X2(x1, x2) for random variablesX1 and X2. Define

Y1 = g1(X1, X2) = X1 + X2

Y2 = g2(X1, X2) = X1

We find fY1,Y2(y1, y2) in terms of fX1,X2(·, ·) as follows.First, we write g−1

1 (y1, y2) = y2 and g−12 (y1, y2) = y1 − y2. Then,

J(y1, y2) =

∣∣∣∣0 11 −1

∣∣∣∣ = −1,

yieldingfY1,Y2(y1, y2) = fX1,X2(y2, y1 − y2). ¤

6

Review

Handout 41

1.4.3 Functions of Random Vectors (Continued)

Linear Transformation

Consider random variables Y1, . . . , YN obtained from linear transformations of X1, . . . , XN ,i.e.

Ym =N∑

n=1

αmnXn, m = 1, . . . , N,

where αmn’s are constant coefficients. By defining X = (X1, . . . , XN), Y = (Y1, . . . , YN),and

A =

α11 · · · α1N...

. . ....

αN1 · · · αNN

,

we can writeY = AX

Assuming that A is invertible, then X = A−1Y. In addition, the Jacobian determi-nant for the transformation is

J(y) = detA−1 =1

detA.

It follows that

fY(y) =1

| detA| · fX(A−1y)

Jointly Gaussian Random Vectors

A set of random variables X1, . . . , XN are zero-mean jointly Gaussian if there is a set ofindependent and identically distributed (IID) zero-mean unit-variance Gaussian randomvariables Z1, . . . , ZM such that we can write

Xn =M∑

m=1

αn,mZm

for all n = 1, . . . , N . For convenience, define random vectors X = (X1, . . . , XN) andZ = (Z1, . . . , ZM). In addition, define a matrix

A =

α1,1 · · · α1,M...

. . ....

αN,1 · · · αN,M


1

so that we can write X = AZ. We shall derive the PDF fX(x) in what follows. Forsimplicity, we focus on the case with M = N . However, the resultant PDF expressionsare also valid for M 6= N .

We begin with the marginal PDF of Zm, which is the zero-mean unit-variance Gaus-sian PDF, i.e.

fZ(z) =1√2π

e−z2/2.

Since Zm’s are IID, we can write

fZ(z) =N∏

m=1

1√2π

e−z2m/2 =

1

(2π)N/2e−

12zTz.

Using the identity fX(x) = 1| detA|fZ(A−1x), we can write

fX(x) =1

(2π)N/2| detA|e− 1

2(A−1x)T(A−1x) =

1

(2π)N/2| detA|e− 1

2xT(AAT)−1x,

where the last inequality follows from the fact that

(A−1x)T(A−1x) = xT(A−1)TA−1x = xT(AT)−1A−1x = xT(AAT)−1x.

Let CX be the covariance matrix for random vector X. It is easy to see that X =AZ = 0, yielding

CX = E[XXT

]= E

[AZ(AZ)T

]= E

[AZZTAT

]= AE

[ZZT

]AT = AAT

where the last equality follows from the fact that E[ZZT

]= I. Since CX = AAT,

detCX = det(AAT) = detA · detAT = | detA|2,

yielding | detA| = √detCX. In conclusion, we can write

fX(x) =1

(2π)N/2√

detCX

e−12xTCX

−1x (zero-mean jointly Gaussian)

More generally, a random vector X is jointly Gaussian if X = X′ + µ, where X′ iszero-mean jointly Gaussian and µ is a constant vector in RN . Note that X = µ. For ajointly Gaussian random vector X, the joint PDF is given by

fX(x) =1

(2π)N/2√

detCX

e−12(x−X)TCX

−1(x−X) (jointly Gaussian)

The proof is similar to the zero-mean jointly Gaussian case and is omitted.Some important properties of jointly Gaussian random vector X are listed below.

1. A linear transformation of X yields another jointly Gaussian random vector.

2. The PDF of X is fully determined by the mean X and the covariance matrix CX,which are the first-order and second-order statistics.

3. Jointly Gaussian random variables that are uncorrelated are independent.

2

Example 1.7 : Recall that the Gaussian PDF has the form

fX(x) =1√

2πσ2X

e− (x−X)2

2σ2X .

We now show that two jointly Gaussian random variables are independent if they areuncorrelated. Let X1 and X2 be jointly Gaussian and uncorrelated. It follows that thecovariance matrix of X = (X1, X2) has the form

CX =

[σ2

1 00 σ2

2

],

where σ21 and σ2

2 are the variances of X1 and X2 respectively. By substituting

√detCX = σ1σ2 and CX

−1 =

[1/σ2

1 00 1/σ2

2

]

into the joint PDF expression of X, we can write

fX1,X2(x1, x2) =1

2πσ1σ2

e− 1

2

((x1−X1)2

σ21

+(x2−X2)2

σ22

)

=

(1√2πσ2

1

e− (x1−X1)2

2σ21

) (1√2πσ2

2

e− (x2−X2)2

2σ22

)

= fX1(x1)fX2(x2),

which implies that X1 and X2 are independent. The argument can in fact be extended ina straightforward manner to show that uncorrelated jointly Gaussian random variablesX1, . . . , XN are independent. ¤

1.5 Common Probability Models

Below are some probability models commonly used in engineering applications.

Continuous Random VariablesName Value PDF CDF Mean Variance

uniform x ∈ [a, b] 1b−a

x−ab−a

a+b2

(b−a)2

12

Gaussian x ∈ R 1√2πσ2

e−(x−µ)2

2σ2 12

+ 12erf

(x−µ√

2σ

)µ σ2

exponential x ≥ 0 λe−λx 1− e−λx 1/λ 1/λ2

λ > 0

Rayleigh x ≥ 0 xβ2 e

− x2

2β2 1− e− x2

2β2 β√

π2

(4−π)β2

2

Discrete Random VariablesName Value PDF CDF Mean Variance

binomial k = 0, . . . , n∑n

k=0

(nk

)× ∑n

k=0

(nk

)× np npq

0 < p < 1 pkqn−kδ(x− k) pkqn−ku(x− k)q = 1− p

Poisson k = 0, 1, . . .∑∞

k=0λke−λ

k!× ∑∞

k=0λke−λ

k!× λ λ

λ > 0 δ(x− k) u(x− k)

3

Example Engineering Applications

Name Applicationuniform modeling of quantization errorGaussian amplitude distribution of thermal noise,

approximation of other distributionsexponential message length and interarrival time in data communicationsRayleigh fading in communication channels,

envelope of bandpass Gaussian noisebinomial number of random transmission errors in a transmitted block

of n digitsPoisson traffic model, e.g. number of message arrivals in a given

time interval

NOTE: The error function is defined as

erf(x) =2√π

∫ x

0

e−u2

du, x > 0

erf(−x) = −erf(x), x < 0

Note that the above integral cannot be evaluated to get a closed form expression. Hence,in practice, the error function is evaluated using a table lookup. In a typical computa-tional software, e.g. MATLAB, there is a command to evaluate the error function.

In digital communications, it is customary to use the Q function, where

Q(x) =

∫ ∞

x

1√2π

e−u2/2du.

In words, Q is the complimentary CDF of the zero-mean unit-variance Gaussian randomvariable. It is useful to note the relationship

Q(x) =1

2− 1

2erf

(x√2

)

Appendix: Central Limit Theorem

The Gaussian distribution plays an important role in statistics due to the central limittheorem stated below.

Central Limit Theorem: Consider IID random variables X1, . . . , XN with mean Xand variance σ2

X . Define the sample mean SN = 1N

∑Nn=1 Xn. Then,

limN→∞

Pr

{SN −X

σX/√

N≤ a

}=

∫ a

−∞

1√2π

e−x2/2dx.

Rougly speaking, the CLT states that, as N gets large, the CDF of SN−X

σX/√

Napproaches

that of a zero-mean unit-variance Gaussian RV.

4

Handout 51

1.6 Characteristic Functions

The characteristic function of a random variable X is defined as

ΦX(ν) = E[ejνX

]=

∫ ∞

−∞ejνxfX(x)dx

Since the integration in the above definition resembles the inverse Fourier transform,it follows that ΦX(ν) and fX(x) are a Fourier transform pair. More explicitly, if wesubstitute x by f and ν by 2πt, then we can write

ΦX(2πt) =

∫ ∞

−∞ej2πftfX(f)df,

which implies that

ΦX(2πt) ↔ fX(f)

It follows that

fX(x) =1

2π

∫ ∞

−∞ΦX(ν)e−jνxdν

since ΦX(ν) and fX(x) form a Fourier transform pair. The characteristic function can beused instead of the PDF as a complete statistical description of a random variable. Byusing the characteristic functions, we can exploit properties of Fourier transform pairs tocompute several quantities of interest, as indicated below.

PDF of a Sum of Independent Random Variables

Consider two independent random variables X and Y with PDFs fX(x) and fY (y) re-spectively. Let Z = X + Y . The PDF of Z, i.e. fZ(z), can be found through the use ofcharacteristic functions as follows.

ΦZ(ν) = E[ejνZ

]= E

[ejν(X+Y )

]= E

[ejνXejνY

]= E

[ejνX

]E

[ejνX

]

= ΦX(ν)ΦY (ν),

Note that the second last equality follows from the independence between X and Y .Since multiplication in the time domain corresponds to convoluation in the frequency

domain, having ΦZ(ν) = ΦX(ν)Φ(ν) is equivalent to having

fZ(z) = fX(z) ∗ fY (z)

More generally, suppose that Z =∑N

n=1 Xn, where X1, . . . , XN are independent withPDFs fX1(x1), . . . , fXN

(xN). We can find the PDF fZ(z) through its characteristic func-tion as follows.

ΦZ(ν) = E[ejν

∑Nn=1 Xi

]= E

[ejνX1 · · · ejνXN

]=

N∏n=1

E[ejνXn

]=

N∏n=1

ΦXn(ν)


1

Hence, for independent X1, . . . , XN and Z =∑N

n=1 Xn,

ΦZ(ν) =N∏

n=1

ΦXn(ν), fZ(z) = fX1(z) ∗ · · · ∗ fXN(z)

In the special case where X1, . . . , XN are IID,

ΦZ(ν) = (ΦX(ν))N , fZ(z) = fX(z) ∗ · · · ∗ fX(z)︸︷︷︸n terms

Finding the nth Moment of a Random Variable

The characteristic function ΦX(ν) is also related to the nth moment of random variableX, i.e. E [Xn], as will be described next. Consider taking the first derivative of ΦX(ν),i.e.

d

dνΦX(ν) =

d

dν

[∫ ∞

−∞ejνxfX(x)dx

]= j

∫ ∞

−∞xejνxfX(x)dx.

Notice that setting ν = 0 will make the last integral equal to the mean of X, yielding

E[X] = −jd

dνΦX(ν)

∣∣∣∣ν=0

.

The above argument can be extended to obtain the nth moment (as long as thecharacteristic function is differentiable up to the nth order), i.e.

E [Xn] = (−j)n dn

dνnΦX(ν)

∣∣∣∣ν=0

Finally, suppose that ΦX(ν) can be expressed as a Taylor series expansion aroundν = 0. Then, ΦX(ν) can be written in terms of the moments of X as follows.

ΦX(ν) =∞∑

n=0

νn

n!· dn

dνnΦX(ν)

∣∣∣∣ν=0

=∞∑

n=0

(jν)n

n!· E [Xn]

Characteristic Functions of Gaussian Random Variables

Consider a Gaussian random variable X with mean X and variance σ2X . We first write

ΦX(ν) as follows.

ΦX(ν) =

∫ ∞

−∞ejνxfX(x)dx =

∫ ∞

−∞ejνx 1√

2πσ2X

e− (x−X)2

2σ2X dx

Recall the following Fourier transform pair for the Gaussian pulse.

Ae−πt2/τ2 ↔ Aτe−πτ2f2

Using the frequency shifting property, we can write

Ae−πt2/τ2

ej2πXt ↔ Aτe−πτ2(f−X)2

2

By setting A = 1 and τ = 1√2πσ2

X

, we can write

e−σ2X(2πt)2/2ejX(2πt) ↔ 1√

2πσ2X

e− (f−X)2

2σ2X

Since the right hand side is equal to fX(f), the left hand side is equal to ΦX(2πt). Itfollows that

ΦX(ν) = ejXν−σ2Xν2/2

Finally, note that it is possible to obtain the above expression through direct integration.However, there will be more computation involved.

Let X1, . . . , XN be independent Gaussian random variables with means X1, . . . , XN

and variances σ21, . . . , σ

2N . In addition, let Z =

∑Nn=1 Xn. We find ΦZ(ν) as follows. Since

X1, . . . , XN are IID,

ΦZ(ν) =N∏

n=1

ΦXn(ν) =N∏

n=1

ejXnν−σ2nν2/2

= ej(∑N

n=1 Xn)ν−(∑N

n=1 σ2n)ν2/2.

Note that ΦZ(ν) is the characteristic function of a Gaussian random variable withmean

∑Nn=1 Xn and variance

∑Nn=1 σ2

n. Hence, we have just shown that a sum of inde-pendent Gaussian random variables is another Gaussian random variable with the meanequal to the sum of individual means and with the variance equal to the sum of individualvariances.

Moment Generating Function

The moment generating function of a random variable X is defined as

ΨX(s) = E[esX

]

Note that the moment generating function is equivalent to the characteristic functionΦX(ν) when s = jν.

As the name suggests, there is a close relationship between ΨX(s) and the nth momentof X. In particular,

E [Xn] =dn

dsnΨX(s)

∣∣∣∣s=0

The proof is quite similar to using the characteristic function and is thus omitted.

Example 1.8 : Consider the exponential PDF

fX(x) = λe−λx, x ≥ 0,

where λ > 0. The characteristic function is computed below.

ΦX(ν) =

∫ ∞

0

ejνx · λe−λxdx = λ

∫ ∞

0

e(jν−λ)xdx

= λ · e(jν−λ)x

(jν − λ)

∣∣∣∣∞

0

= − λ

(jν − λ)=

jλ

ν + jλ

3

The mean or first moment is computed below.

E[X] = −j · d

dνΦX(ν)

∣∣∣∣ν=0

= −j · −jλ

(ν + jλ)2

∣∣∣∣ν=0

=1

λ

The second moment is computed below.

E[X2

]= − d2

dν2ΦX(ν)

∣∣∣∣ν=0

= − j2λ

(ν + jλ)3

∣∣∣∣ν=0

=2

λ2

The variance is computed below.

var[X] = E[X2

]−X2

=2

λ2−

(1

λ

)2

=1

λ2 ¤

Appendix: Central Moments of Gaussian Random Variables

We show in this appendix that, for a Gaussian random variable X with mean X andvariance σ2

X ,

E[(X −X)n

]=

{1 · 3 · 5 · · · (n− 1)σn

X , n even0, n odd

For convenience, let Y = X − X. Note that E [Y n] = E[(X −X)n

]. Then, Y is

Gaussian with zero mean and variance σ2X . It follows that ΦY (ν) = e−σ2

Xν2/2. We use thefact that et =

∑∞m=0 tm/m! to write

ΦY (ν) =∞∑

m=0

(−σ2Xν2/2)m

m!=

∞∑m=0

(−1)mσ2mX ν2m

2mm!=

∞∑m=0

(jν)2m

(2m)!· (2m)!σ2m

X

2mm!

=∞∑

n=0,n even

(jν)n

n!· n!σn

X

2n/2(n/2)!=

∞∑n=0,n even

(jν)n

n!· 1 · 2 · 3 · · ·n2 · 4 · 6 · · ·nσn

X

=∞∑

n=0,n even

(jν)n

n!· 1 · 3 · 5 · · · (n− 1)σn

X .

By comparing term by term to the Taylor’s series expansion mentioned previosly, i.e.

ΦY (ν) =∞∑

n=0

(jν)n

n!· E [Y n] ,

the desired expression follows. ¤

4

Review

Asian Institute of Technology

Handout 61

2.7 Upper Bounds on the Tail Probabilities

2.7.1 Another Look at the Chebyshev Inequality

Recall that, for a random variable X with mean X and variance σ2X ,

Pr{|X −X| ≥ δ

} ≤ σ2X

δ2

where δ > 0.We now provide an alternative derivation of this inequality. Consider the function

g(y) defined as follows.

g(y) =

{1, |y| ≥ δ0, |y| < δ

Figure 2.7 illustrates that g(y) ≤ y2/δ2, which implies that

E[g(Y )] ≤ E[Y 2/δ2

]

for an arbitrary random variable Y . Note that

E[g(Y )] = 0 · Pr {|Y | < δ}+ 1 · Pr {|Y | ≥ δ} = Pr {|Y | ≥ δ} .

In addition, if Y has zero mean, then E [Y 2/δ2] = σ2Y /δ2, yielding

Pr {|Y | ≥ δ} ≤ σ2Y

δ2.

Finally, let Y = X −X. Since σ2X = σ2

Y , we can write the desired expression, i.e.

Pr{|X −X| ≥ δ

} ≤ σ2X

δ2.

1

0

Figure 2.7: Bound on function g(y) for the Chebyshev inequality.

The Chebyshev bound is found to be “loose” for a large number of practical appli-cations. One reason is the looseness of the function y2/δ2 as an upper bound on thefunction g(y).


1

2.7.2 Chernoff Bound

Tighter upper bounds can often be obtained using the Chernoff bound, which is derivedas follows. First, define the function g(x) as

g(x) =

{1, x ≥ δ0, x < δ

Figure 2.8 illustrates that g(x) ≤ es(x−δ) for any s > 0, which implies that

E[g(X)] ≤ E[es(X−δ)

]

for an arbitrary random variable X. Note that

E[g(X)] = 0 · Pr{X < δ}+ 1 · Pr{X ≥ δ} = Pr{X ≥ δ}.

It follows thatPr{X ≥ δ} ≤ e−sδE

[esX

], s > 0

1

0

Figure 2.8: Bound on function g(x) for the Chernoff bound.

The above expression gives an upper bound on the “upper tail” of the PDF. Thetightest bound can be obtained by minimizing the upper bound expression with respectto s, i.e. solving for s from

0 =d

dsE

[es(X−δ)

]= E

[(X − δ)es(X−δ)

]= e−sδ

(E

[XesX

]− δE[esX

]).

Thus, the tightest bound is obtained by setting s = s∗, where

E[Xes∗X]

= δE[es∗X]

, s∗ > 0

An upper bound on the “lower tail” of the PDF can be derived similarly, yielding

Pr{X ≤ δ} ≤ e−sδE[esX

], s < 0

The tightest bound is obtained by setting s = s∗, where

E[Xes∗X]

= δE[es∗X]

, s∗ < 0

2

Another Look at the Chernoff Bound

Recall that the Chebyshev bound can be derived from the Markov inequality, i.e. Pr{X ≥a} ≤ X/a for nonnegative random variable X. Similarly, we can derive the Chernoffbound from the Markov inequality, as stated formally below.

Theorem (Chernoff bound): For a random variable X,

Pr{X ≥ δ} ≤ e−sδE[esX

], s > 0

Pr{X ≤ δ} ≤ e−sδE[esX

], s < 0

Proof: Take esX as a random variable in the Markov inequality. In addition, view theevent X ≥ δ as being equivalent to esX ≥ esδ for s > 0. Finally, view the event X ≤ δ asbeing equivalent to esX ≥ esδ for s < 0. ¤

Example 2.9 : Consider the Laplace PDF

fX(x) =1

2e−|x|.

It is left as an exercise to verify that X = 0, σ2X = 2, E

[esX

]= 1

1−s2 , and

Pr{Y ≥ δ} =1

2e−δ (exact)

for any δ > 0. The Chebyshev bound is

Pr{|Y | ≥ δ} ≤ 2

δ2.

Since fX(x) is even, we can write

Pr{Y ≥ δ} ≤ 1

δ2(Chebyshev)

The Chernoff bound is given by

Pr{Y ≥ δ} ≤ e−sδE[esX

]=

e−sδ

1− s2.

The bound can be optimized by setting s = −1+√

1+δ2

δ, yielding

Pr{Y ≥ δ} ≤ δ2

2(−1 +

√1 + δ2

)e1−√1+δ2 ≈ δ

2e−δ (Chernoff)

for δ À 1. Thus, the Chernoff bound (with exponential decrease) is much tighter thanthe Chebyshev bound (with polynomial decrease) for large δ. ¤

3

2.7.3 Tail Probabilities for a Sum of IID Random Variables

Let X1, X2, . . . be IID random variables with finite mean X and finite variance σ2X . Define

the sample mean

SN =1

N

N∑n=1

Xn.

Note that the mean of SN is

E [SN ] = E

[1

N

N∑n=1

Xn

]=

1

N

N∑n=1

E [Xn] =1

N·NX = X.

In addition, since X1, X2, . . . are IID,

var [SN ] = var

[1

N

N∑n=1

Xn

]=

1

N2

N∑n=1

var [Xn] =1

N2·Nσ2

X =σ2

X

N.

Since var [SN ] goes to 0 as N → ∞, we expect that SN approaches X. The followingweak law of large numbers states that this is the case with probability approaching 1 asN →∞.

Theorem (Weak law of large numbers): For any δ > 0,

limN→∞

Pr{∣∣SN −X

∣∣ ≥ δ}

= 0.

Proof: Take SN as a random variable in the Chebyshev inequality and consider the limitas N →∞. ¤

As discussed above, the weak law of large numbers results from applying the Cheby-shev inequality to the sample mean SN . Let us now consider applying the Chernoff boundto SN . We start by writing, for any s > 0,

Pr {SN ≥ δ} = Pr {NSN ≥ Nδ}≤ e−sNδE

[esNSN

]= e−sNδE

[esX1 · · · esXN

].

Since X1, . . . , XN are IID, we have

Pr {SN ≥ δ} ≤ (e−sδE

[esX1

])N.

The bound is minimized by choosing s = s∗ such that the derivative of the bound is equalto zero, i.e.

E[X1e

s∗X1]

= δE[es∗X1

], s∗ > 0

4

Example 2.10 : Consider IID random variables X1, X2, . . . with

Xn =

{1, with probability p−1, with probability 1− p

where we assume that p < 1/2. We shall use the Chernoff bound to show that

Pr

{N∑

n=1

Xn ≥ 0

}≤ (4p(1− p))N/2 .

First, note that having∑N

n=1 Xn ≥ 0 is equivalent to having SN ≥ 0. Hence,

Pr

{N∑

n=1

Xn ≥ 0

}= Pr {Sn ≥ 0} ≤ (

E[esX1

])N, s > 0.

From the given PDF, E[esX1

]= pes + (1− p)e−s, yielding

Pr

{N∑

n=1

Xn ≥ 0

}≤ (

pes + (1− p)e−s)N

, s > 0.

The bound can be minimized by setting es =√

1−pp

, yielding the desired expression. ¤

Appendix: Partial Justification of Central Limit Theorem

Consider IID random variables X1, X2, . . . with mean X and variance σ2X . Let SN =

1N

∑Nn=1 Xn. We shall show below that the characteristic function of SN−X

σX/√

Napproaches

that of the zero-mean unit-variance Gaussian PDF as N →∞.For convenience, let Un = Xn−X

σX. Note that Un has zero mean and unit variance. In

addition, note that

W =SN −X

σX/√

N=

1√N

N∑n=1

Un

has zero mean and unit variance. Since X1, X2, . . . are IID, so are U1, U2, . . .. Let ΦU(ν)denote the characteristic function for each Un. Assume that ΦU(ν) can be expressed usingthe Taylor series expansion around ν = 0, i.e.

ΦU(ν) =∞∑

m=0

νm

m!· dm

dνmΦU(ν)

∣∣∣∣ν=0

=∞∑

m=0

(jν)m

m!· E [Um]

Since U1, U2, . . . are IID,

ΦW (ν) = E[ejνW

]= E

[e

jν√N

∑Nn=1 Un

]= E

[e

jνU1√N · · · e

jνUN√N

]=

(ΦU

(ν√N

))N

Applying the Taylor series expansion of ΦU(ν/√

N) around ν = 0 and the assumptionthat E[U ] = U = 0 and E [U2] = var[U ] = 1,

ΦU

(ν√N

)= 1 +

jν/√

N

1!E[U ] +

(jν/√

N)2

2!E

[U2

]+ RN(ν)

= 1− ν2

2N+ RN(ν),

5

where RN(ν) is the remainder term that goes to 0 as N →∞.It follows that

ln ΦW (ν) = N ln

(1− ν2

2N+ RN(ν)

).

We now use the fact that ln(1 + x) ≈ x for small x to write

limN→∞

ln ΦW (ν) = −ν2

2,

or equivalentlyΦW (ν) = e−ν2/2,

which is the characteristic function of a zero-mean unit-variance Gaussian random vari-able. Thus, in the limit as N → ∞, W becomes a zero-mean unit-variance Gaussianrandom variable.

In general, the PDF of W may not approach the Gaussian PDF. However, the CDFof W will approach the Gaussian CDF, as stated previously in the central limit theorem.

6

Review

Handout 71

1.8 Additional Discussions on Commonly Used PDFs

Chi-Square PDF

Consider a zero-mean Gaussian random variable X with variance σ2. Let Y = X2. Using

fY (y) =∑K

k=1 fX(g−1k (y)) ·

∣∣∣dg−1k (y)

dy

∣∣∣, the PDF of Y can be written as

fY (y) =1

2√

y[fX(−√y) + fX(

√y)] , y ≥ 0.

Substituting fX(x) = 1√2πσ2

e−x2/2σ2and using the even propertfy of fX(x),

fY (y) =1√

2πyσ2e−

y

2σ2 , y ≥ 0

With the above PDF, Y is called a chi-square random variable with one degree of freedom.Its characteristic function is written below.

ΦY (ν) =

∫ ∞

0

ejνy 1√2πyσ2

e−y

2σ2 dy

=

∫ ∞

0

1√2πσ2

e−(1−j2σ2ν)y

2σ2dy√

y.

By substituting u = ((1− j2σ2ν)y)1/2

, we can write

ΦY (ν) =2

(1− j2σ2ν)1/2

∫ ∞

0

1√2πσ2

e−u2

2σ2 du =1

(1− j2σ2ν)1/2,

where the last equality follows from the fact that the integral is equal to half the areaunder the zero-mean unit-variance Gaussian PDF curve.

Consider now N IID zero-mean Gaussian random variables X1, . . . , XN with varianceσ2. Let Z =

∑Nn=1 X2

n. Then, Z is a chi-square random variable with N degrees offreedom. We find the PDF of Z by writing its characteristic function as follows. Forconvenience, let Yn = X2

n. Note that each Yn is a chi-square random variable with onedegree of freedom. Since Z is a sum of IID random variables Y1, . . . , YN , we can writeΦZ(ν) = (ΦY1(ν))N , yielding

ΦZ(ν) =1

(1− j2σ2ν)N/2

Recall that ΦZ(2πt) ↔ fZ(f). It can be verified through straightforward computationthat the inverse Fourier transform of the following PDF yields the above characteristicfunction.

fZ(z) =1

σN2N/2Γ(N/2)z

N2−1e−

z2σ2 , z ≥ 0


1

where Γ(p) is the Gamma function defined as

Γ(p) =

∫ ∞

0

xp−1e−xdx, p > 0

Below are some key properties of the Gamma function. Their proofs are left as exercises.

1. Γ(1) = 1

2. Γ(p) = (p− 1)Γ(p− 1)

3. Γ(n) = (n− 1)!, n = 1, 2, . . .

4. Γ(1/2) =√

π

Finally, it should be noted that, for N = 2, a chi-square random variable with twodegrees of freedom is equivalent to an exponential random variable. You should be ableto verify that

E[Z] = Nσ2, var[Z] = 2Nσ4.

Rayleigh PDF

Let X1 and X2 be two IID zero-mean Gaussian random variables with variance σ2. DefineR =

√X2

1 + X22 . Then, R is a Rayleigh random variable. The PDF of R is derived as

follows. We first define Y = X21 + X2

2 . It follows that Y has the exponential PDF

fY (y) =1

2σ2e−

y

2σ2 , y ≥ 0.

Since R =√

Y , we can write

fR(r) = fY (r2) ·∣∣∣∣

d

drr2

∣∣∣∣

yielding

fR(r) =r

σ2e−

r2

2σ2 , r ≥ 0

The mean and variance of a Rayleigh random variable are given by

E[R] = σ

√π

2, var[R] =

4− π

2σ2.

Bernoulli Distribution

A Bernoulli random variable X has the following probabilities

X =

{0, with probability 1− p1, with probability p

The event that X = 1 is often referred to as a “success”. You should be able to verifythat

E[X] = p, var[X] = p(1− p), ΦX(ν) = 1− p + pejν .

2

Binomial Distribution

Let X1, . . . , XN be IID Bernoulli random variables with parameter p. Then, Y =∑Nn=1 Xn is a binomial random variable whose probabilities are given by

Pr{X = k} =

(N

k

)pk(1− p)N−k, k = 0, 1, . . . , N

The value Pr{X = k} gives the probability that k out of N events are “successful”,where each event is succcessful with probability p. You should be able to verify that

E[X] = Np, var[X] = Np(1− p), ΦX(ν) =(1− p + pejν

)N.

Geometric Distribution

Consider an experiment in which each independent trial is successful with probability p.Let X denote the number of trials required until the first success, i.e. the first X − 1trials fail. Then, X is a geometric random variable with the following probabilities

Pr{X = k} = (1− p)k−1p, k = 1, 2, . . .

You should be able to verify that

E[X] =1

p, var[X] =

1− p

p2, ΦX(ν) =

pejν

1− (1− p)ejν.

Alternatively, X can be defined as the number of failures before the first success. Inthis case, the probabilities of X is

Pr{X = k} = (1− p)kp, k = 0, 1, . . .

You should be able to verify that

E[X] =1− p

p, var[X] =

1− p

p2, ΦX(ν) =

p

1− (1− p)ejν.

Poisson Distribution

A Poisson random variable X with parameter λ has the following probabilities.

Pr{X = k} = e−λ λk

k!, k = 0, 1, . . .

A Poisson random variable represents the number of arrivals in one time unit for anarrival process in which interarrival times are independent exponential random variables.You should be able to verify that

E[X] = λ, var[X] = λ, ΦX(ν) = exp(λ(ejν − 1)

).

3

Review

Handout 81

2 Random Processes

2.1 Definition of Random Processes

Recall that a random variable is a mapping from the sample space S to the set of realnumbers R. In comparison, a stochastic process or random process is a mapping fromthe sample space S to the set of real-valued functions called sample functions. Figure 2.1illustrates the mapping for a random process.

Figure 2.1: Mapping from sample points in the sample space to sample functions.

We can denote a random process as {X(t), t ∈ R} to emphasize that it consists ofa set of random variables, one for each time t. However, for convenience, we normallywrite X(t) instead of {X(t), t ∈ R} to denote a random process. The sample functionfor sample point s ∈ S is denoted as x(t, s). Note that, once we specify s, the process isno longer random (and hence denoted by x(t, s) instead of X(t, s)).

Finally, a complex random process X(t) is defined as X(t) = XR(t) + jXI(t), whereXR(t) and XI(t) are (real) random processes. For the sake of generality, we assume inour discussion that X(t) is a complex random process, unless explicitly stated otherwise.

2.2 Statistics of Random Processes

Recall that the value of a random process X(t) at time instant t is a random variable.We can define the mean of random process X(t) by taking the expectation of X(t) foreach t, i.e.

X(t) = E[X(t)]


1

The autocorrelation function of X(t) is defined as

RX(t1, t2) = E [X(t1)X∗(t2)]

The autocovariance function of X(t) is defined as

CX(t1, t2) = E[(

X(t1)−X(t1))(

X(t2)−X(t2))∗]

.

= RX(t1, t2)−X(t1) X(t2)∗

Similarly, the cross-correlation function of two random processes X(t) and Y (t) isdefined as

RXY (t1, t2) = E [X(t1)Y∗(t2)]

The cross-covariance function of X(t) and Y (t) is defined as

CXY (t1, t2) = E[(

X(t1)−X(t1))(

Y (t2)− Y (t2))∗]

.

= RXY (t1, t2)−X(t1) Y (t2)∗

By the analogy with random variables, random processes X(t) and Y (t) are uncorre-lated if

CXY (t1, t2) = 0 for all t1, t2 ∈ R.

and statistically independent if the joint CDF satisfies

FX(t1),...,X(tm),Y (t′1),...,Y (t′n)(x1, . . . , xm, y1, . . . , yn) = FX(t1),...,X(tm)(x1, . . . , xm)

× FY (t′1),...,Y (t′n)(y1, . . . , yn)

for all m,n ∈ Z+, t1, . . . , tm, t′1, . . . , t′n ∈ R, and x1, . . . , xm, y1, . . . , yn ∈ C. As with

random variables, independence implies uncorrelatedness, but the converse is not true ingeneral.

Time Averages

The mean X(t) as defined above is also referred to as the ensemble average. The timeaverage of sample function x(t) is denoted and defined as follows.

〈x(t)〉 = limT→∞

1

T

∫ T/2

−T/2

x(t)dt

Similarly, the time autocorrelation function of sample function x(t) is

〈x(t)x∗(t− τ)〉 = limT→∞

1

T

∫ T/2

−T/2

x(t)x∗(t− τ)dt

2

2.3 Stationary, Ergodic, and Cyclostationary Processes

A random process is strict sense stationary (SSS) if, for all values of n ∈ Z+ andt1, . . . , tn, τ ∈ R, the joint CDF satisfies

FX(t1),...,X(tn)(x1, . . . , xn) = FX(t1+τ),...,X(tn+τ)(x1, . . . , xn)

for all x1, . . . , xn ∈ C. Roughly speaking, the statistics of the random process looks thesame at all time.

For the purpose of analyzing communication systems, it is usually sufficient to assumea stationary condition that is weaker than SSS. In particular, a random process X(t) iswide-sense stationary (WSS) if, for all t1, t2 ∈ R,

X(t1) = X(0) and CX(t1, t2) = CX(t1 − t2, 0).

Roughly speaking, for a WSS random process, the first and second order statistics lookthe same at all time. Note that a SSS random process is always WSS, but the converseis not always true.

Since the autocorrelation function RX(t1, t2) of a WSS random process only dependson the time difference t1−t2, we usually write RX(t1, t2) as a function with one argument,i.e. RX(t1 − t2). Similarly, for a WSS process, we can write the autocovariance functionCX(t1, t2) as CX(t1 − t2).

A random process is ergodic if all statistical properties that are ensemble averagesare equal to the corresponding time averages. An ergodic process must be SSS, butergodicity is a stronger condition than the SSS condition, i.e. some SSS process is notergodic. Since all statistical properties of an ergodic process can be determined from asingle sample function, each sample function of an ergodic process is representative of theentire process.

Randomly phased sinusoid and stationary Gaussian process are examples of ergodicprocesses. However, a test of ergodicity for an arbitrary random process in quite difficultin general and is beyond the scope of this course. For analysis, we shall assume that therandom process of interest is ergodic, unless explicitly stated otherwise.

Example 2.1 (Randomly phased sinusoidal): Consider the random process X(t)defined as

X(t) = A cos(2πf0t + Φ),

where A, f0 > 0 are constants and Φ is a random variable uniformly distributed in theinterval [0, 2π]. The mean of X(t) is computed as

X(t) = E[A cos(2πf0 + Φ)]

= A

∫ 2π

0

1

2πcos(2πf0 + ϕ)dϕ

= 0,

where the last equality follows from the fact that the integral is taken over one period ofthe cosine function and is hence zero.

3

The autocovariance function CX(t1, t2) is computed as

CX(t1, t2) = E[X(t1)X∗(t2)]

= A2E [cos(2πf0t1 + Φ) cos(2πf0t2 + Φ)]

= A2E

[1

2cos(2πf0(t1 − t2)) +

1

2cos(2πf0(t1 + t2) + 2Φ)

]

=A2

2cos(2πf0(t1 − t2)) +

A2

2E [cos(2πf0(t1 + t2) + 2Φ)]

=A2

2cos(2πf0(t1 − t2)) +

A2

2

∫ 2π

0

cos(2πf0(t1 + t2) + 2ϕ)1

2πdϕ

=A2

2cos(2πf0(t1 − t2)),

where the last equality follows from the fact that the integral is taken over two periodsof the cosine function and is hence zero. Since X(t) = 0 and CX(t1, t2) depends onlyon t1 − t2, X(t) is WSS. For Φ = ϕ, the time average of the sample function x(t) =A2 cos(2πf0t + ϕ) is

〈x(t)〉 = limT→∞

1

T

∫ T/2

−T/2

A2 cos(2πf0t + ϕ)dt = 0.

Note that the time average is equal to the ensemble average. ¤

Example 2.2 : Consider the random process X(t) defined as

X(t) = A cos(2πF0t + Φ),

where A > 0 is a constant, Φ is a random variable uniformly distributed in the interval[0, 2π], and F is a random variable independent of Φ with PDF fF0(f0). For F0 = f0, themean of X(t) is computed as

E [X(t)|F0 = f0] = E[A cos(2πf0 + Φ)] = 0,

which follows from the previous example. It follows that

X(t) =

∫ ∞

−∞E [X(t)|F0 = f0] fF0(f0)df0 = 0.

The autocovariance function CX(t1, t2) is computed as

CX(t1, t2) = E[X(t1)X∗(t2)]

=

∫ ∞

0

E[X(t1)X∗(t2)|F0 = f0]fF0(f0)df0

=A2

2

∫ ∞

−∞cos(2πf0(t1 − t2))fF0(f0)df0,

where the last equality follows from the previous example. Since X(t) = 0 and CX(t1, t2)depends only on t1 − t2, X(t) is WSS. ¤

4

Example 2.3 : Consider the random process defined as

X(t) = 6eΦt,

where Φ is a random variable uniformly distributed in [0, 2]. The ensemble average is

X(t) =

∫ ∞

−∞6eϕt 1

2dϕ =

3

t· eϕt

∣∣20

=3

t

(e2t − 1

).

Since X(t) depends on time t, X(t) is not WSS. The autocorrelation function RX(t1, t2)is computed as

RX(t1, t2) = E[X(t1)X∗(t2)] = E

[6eΦt1 · 6eΦt2

]

= 36

∫ 2

0

eϕ(t1+t2) · 1

2dϕ =

18

t1 + t2· eϕ(t1+t2)

∣∣20

=18

t1 + t2

(e2(t1+t2) − 1

). ¤

5

Review

Handout 91

Autocorrelation Functions of WSS Random Processes

Consider a WSS random process X(t). Recall that its autocorrelation function RX(t1, t2)can be written as RX(τ) with τ = t1−t2. The autocorrelation function RX(τ) has similarproperties as for deterministic signals, i.e. for Rx(τ) with respect to a power or energysignal x(t).

1. RX(−τ) = R∗X(τ)

2. RX(0) ≥ 0

3. |RX(τ)| ≤ RX(0)

Example 2.4 : The first two statements are proven below. 2

1. From RX(τ) = E [X(t)X∗(t− τ)],

RX(−τ) = E [X(t)X∗(t + τ)] = (E [X(t + τ)X∗(t)])∗ = R∗X(τ).

2. From RX(τ) = E [X(t)X∗(t− τ)],

RX(0) = E [X(t)X∗(t)] = E[|X(t)|2] ≥ 0,

where the last inequality follows since |X(t)|2 ≥ 0. ¤

Jointly WSS Processes

Two random processes X(t) and Y (t) are jointly WSS if their cross-correlation functionsatisfy

RXY (t1, t2) = RXY (t1 − t2, 0)

and can be written as RXY (τ) with τ = t1 − t2. Below are some basic properties ofRXY (τ). Their proofs are omitted.

1. RXY (−τ) = R∗XY (τ)

2. |RXY (τ)| ≤√

RX(0)RY (0)

3. |RXY (τ)| ≤ 1

2(RX(0) + RY (0))


2The third statement is somewhat more difficult to show. To do so, we can justify the followingstatement (using the same argument as for the derivation of the Schwarz inequality)

E [U(t)V ∗(t)] ≤√

E [|U(t)|2] E [|V (t)|2],

and use the above inequality to establish the third statement by setting U(t) = X(t) and V (t) = X(t−τ).

1

2.4 Gaussian Processes

A random process X(t) is a zero-mean Gaussian process if, for all N ∈ Z+ and t1, . . . , tN ∈R, (X(t1), . . . , X(tN)) is a zero-mean jointly Gaussian random vector. In addition, wesay that X(t) is a Gaussian process if it is the sum of a zero-mean Gaussian process andsome deterministic function µ(t). Note that X(t) = µ(t).

Some important properties of Gaussian process X(t) are listed below. The proofs arebeyond the scope of this course and are omitted.

1. If we pass X(t) through an LTI filter with impulse response h(t), the output X(t)∗h(t) is a Gaussian process.

2. The statistics of X(t) is fully determined by the mean X(t) and the covariancefunction CX(t1, t2).

3. We refer to the quantity of the form∫∞−∞ X(t)u(t)dt as an observable or linear

functional of X(t). Any set of linear functionals of X(t) are jointly Gaussian.

4. A WSS Gaussian process is also SSS as well as ergodic.

2.5 Spectral Characteristics of Random Signals

An energy signal x(t) must be time-limited in the sense that |x(t)| → 0 as |t| → ∞. Thus,the statistics of such a signal cannot be time invariant. In general, we can conclude thata stationary random signal must be a power signal.

Due to their random nature, random signals may not satisfy the conditions for theexistence of Fourier transforms, e.g. absolutely integrable. However, for a WSS randomsignal X(t), we can talk about the power spectral density (PSD) or the power spectrum,denoted by GX(f), as the Fourier transform of its autocorrelation function RX(τ), i.e.

RX(τ) ↔ GX(f)

The PSD GX(f) is a real and nonnegative function of f . The average power of arandom signal X(t) in the frequency band [f1, f2] can be computed from its PSD as

P[f1,f2] =

∫ −f1

−f2

GX(f)df +

∫ f2

f1

GX(f)df

For τ = 0, we have, through the inverse Fourier transform, the average power of thesignal equal to

E[|X(t)|2] = RX(0) =

∫ ∞

−∞GX(f)df

which is a counterpart of the Parseval’s theorem for deterministic signals.

Example 2.5 : Consider again the randomly phased sinusoidal signal

X(t) = A cos(2πf0t + Φ),

2

where A and f0 are positive constants and Φ is uniformly distributed in [0, 2π]. Recallthat RX(τ) = A2

2cos(2πf0τ). It follows that

GX(f) = F {RX(τ)} =A2

4δ(f − f0) +

A2

4δ(f + f0).

In addition, as another illustration of the ergodicity of X(t), consider computing thetime-average autocorrelation for an arbitrary sample function for Φ = ϕ as follows.

〈X(t)X∗(t− τ)〉 = limT→∞

1

T

∫ T/2

−T/2

A cos(2πf0t + ϕ)A cos(2πf0(t− τ) + ϕ)dt

= limT→∞

A2

2T

∫ T/2

−T/2

(cos(2πf0τ) + cos(4πf0t− 2πf0τ + 2ϕ)) dt

= limT→∞

A2

2cos(2πf0τ) +

A2

2T

∫ T/2

−T/2

cos(4πf0t− 2πf0τ + 2ϕ)dt

︸︷︷︸=0 as T→∞

=A2

2cos(2πf0τ)

Note that the time average is equal to the ensemble average. ¤

PSD of the Sum of Random Signals

Consider the sum of two WSS random signals X(t) and Y (t). Assume that each signalis also WSS. Let Z(t) = X(t) + Y (t). The mean of Z(t) is computed as

Z(t) = E[X(t) + Y (t)] = E[X(t)] + E[Y (t)]

= E[X(0)] + E[Y (0)] = E[X(0) + Y (0)] = E[Z(0)] = Z(0)

where we have applied the WSS properties of X(t) and Y (t) to write E[X(t)] = E[X(0)]and E[Y (t)] = E[Y (0)]. The autocorrelation function of Z(t) is computed below.

RZ(t1, t2) = E [Z(t1)Z∗(t2)]

= E [(X(t1) + Y (t1))(X∗(t2) + Y ∗(t2))]

= E [X(t1)X∗(t2)] + E [Y (t1)Y

∗(t2)] + E [X(t1)Y∗(t2)] + E [Y (t1)X

∗(t2)]

= RX(t1, t2) + RY (t1, t2) + RXY (t1, t2) + RY X(t1, t2)

Since X(t) and Y (t) are WSS as well as jointly WSS,

RZ(t1, t2) = RX(t1 − t2) + RY (t1 − t2) + RXY (t1 − t2) + RY X(t1 − t2).

Note that RZ(t1, t2) depends only on t1 − t2. It follows that Z(t) is also WSS. Conse-quently, we can write

RZ(τ) = RX(τ) + RY (τ) + RXY (τ) + RY X(τ)

which in the frequency domain becomes

GZ(f) = GX(f) + GY (f) + GXY (f) + GY X(f)

3

where define the cross PSDs such that

RXY (τ) ↔ GXY (f) and RY X(τ) ↔ GY X(f).

If X(t) and Y (t) have zero mean and are uncorrelated, then RXY (τ) = RY X(τ) = 0for all τ , yielding

RZ(τ) = RX(τ) + RY (τ).

In terms of the PSD,GZ(f) = GX(f) + GY (f).

Thus, for zero-mean uncorrelated jointly WSS random signals, superposition holds for theautocorrelation function as well as for the PSD.

PSD of the Product of Random Signals

We first show that, for independent WSS random signals X(t) and Y (t), the signalZ(t) = X(t)Y (t) is also WSS and satisfies

RZ(τ) = RX(τ)RY (τ) ↔ GZ(f) = GX(τ) ∗GY (τ)

Proof: Using the independence between X(t) and Y (t), the mean of Z(t) is written as

Z(t) = E[X(t)Y (t)] = E[X(t)]E[Y (t)]

= E[X(0)]E[Y (0)] = E[X(0)Y (0)] = E[Z(0)] = Z(0).

Using the independence between X(t) and Y (t) and their WSS properties, the autocor-relation function is written as

RZ(t1, t2) = E [X(t1)Y (t1)X∗(t2)Y ∗(t2)] = E [X(t1)X

∗(t2)] E [Y (t1)Y∗(t2)]

= RX(t1, t2)RY (t1, t2) = RX(t1 − t2)RY (t1 − t2).

Note that RZ(t1, t2) depends only on t1 − t2. It follows that Z(t) is also WSS. Conse-quently, we can write RZ(τ) = RX(τ)RY (τ).

Since multiplication in the time domain corresponds to convolution in the frequencydomain, we can write GZ(f) = GX(f) ∗GY (f). ¤

Example 2.6 (Modulated random signal): Consider the modulated random signal

Y (t) = X(t) cos(2πf0t + Φ),

where X(t) is a WSS random signal while the random phase Φ is uniformly distributedin [0, 2π] and is independent of X(t).

Recall that U(t) = cos(2πf0t + Φ) is WSS with the autocorrelation function RU(τ) =12cos(2πf0τ). It follows that Y (t) is WSS with the following autocorrelation function.

RY (τ) = RX(τ) · 1

2cos(2πf0τ) =

1

2RX(τ) cos(2πf0τ)

In the frequency domain,

GY (f) =1

2GX(f) ∗

(1

2δ(f − f0) +

1

2δ(f + f0)

)=

1

4GX(f − f0) +

1

4GX(f + f0). ¤

4

Review

Handout 101

2.6 Random Signals and LTI Systems

Consider passing a random signal X(t) through an LTI filter whose impulse response ish(t). Consider the output signal which is also a random signal given below.

Y (t) =

∫ ∞

−∞h(τ)X(t− τ)dτ

Properties of Y (t)

1. Mean, autocorrelation, and PSD: The mean of Y (t) is computed below.

E[Y (t)] = E

[∫ ∞

−∞h(τ)X(t− τ)dτ

]

If X(t) is WSS, then

E[Y (t)] =

∫ ∞

−∞h(τ)E[X(t− τ)]dτ = X(0)

∫ ∞

−∞h(τ)dτ = H(0)X(0).

Assuming that X(t) is WSS, the autocorrelation function of Y (t) is computed below.

RY (τ) = E

[(∫ ∞

−∞h(η)X(τ − η)dη

)(∫ ∞

−∞h∗(−ξ)X∗(ξ)dξ

)]

= E

[∫ ∞

−∞

∫ ∞

−∞h(η)h∗(−ξ)X(τ − η)X∗(ξ)dξdη

]

=

∫ ∞

−∞

∫ ∞

−∞h(η)h∗(−ξ)E [X(τ − η)X∗(ξ)] dξdη

=

∫ ∞

−∞

∫ ∞

−∞h(η)h∗(−ξ)RX(τ − η − ξ)dξdη

=

∫ ∞

−∞h(η)

(∫ ∞

−∞h∗(−ξ)RX(τ − η − ξ)dξ

)

︸︷︷︸z(τ−η) where z(τ)=h∗(−τ)∗RX(τ)

dη

=

∫ ∞

−∞h(η)z(τ − η)dη

= h(τ) ∗ z(τ) = h(τ) ∗ h∗(−τ) ∗RX(τ)

In the frequency domain, the PSD of Y (t) is given below.

GY (f) = H(f)H∗(f)GX(f) = |H(f)|2GX(f)


1

For an ergodic process, RY (0) yields the average power of a filtered random signal,i.e.

P = RY (0) =

∫ ∞

−∞|H(f)|2GX(f)df.

In summary, for an output filtered process,

RY (τ) = h(τ) ∗ h∗(−τ) ∗RX(τ)

GY (f) = |H(f)|2GX(f)

P = RY (0) =

∫ ∞

−∞|H(f)|2GX(f)df

2. Stationarity: If the input X(t) is WSS, then the output Y (t) is also WSS. Inaddition, if X(t) is SSS, so is Y (t).

3. PDF: In general, it is difficult to determine the PDF of the output, even when thePDF of the input signal is completely specified.

However, when the input is a Gaussian process, the output is also a Gaussian pro-cess. The statistics of the output process is fully determined by the mean functionand the autocovariance function.

Example 2.7 : Consider the LTI system whose input x(t) and output y(t) are relatedby

y(t) = x(t) + ax(t− T ).

The corresponding impulse response is

h(t) = δ(t) + aδ(t− T ).

The corresponding frequency response is

H(f) = 1 + ae−j2πfT .

If X(t) is WSS with PSD GX(f), then the PSD of GY (f) is

GY (f) = |H(f)|2GX(f) = (1 + a2 + 2a cos(2πfT ))GX(f). ¤

†Power Spectrum Estimation

One problem that is often encountered in practice is to estimate the PSD of a randomsignal x(t) when only a segment of length T of a single sample function is available.

Let us consider a single sample function of an ergodic random process x(t). Itstruncated version is given as

xT (t) =

{x(t), |t| ≤ T/20, otherwise

2

Since xT (t) is strictly time-limited, its Fourier transform XT (f) exists. An alternativedefinition of the PSD of X(t) is stated as

GX(f) = limT→∞

1

TE

[|XT (f)|2] .

A “natural” estimate of the PSD can be found by simply omitting the limiting andexpectation operations to obtain

GX(f) =1

T|XT (f)|2 .

This spectral estimate is called a periodogram. In practice, spectral estimation based ona periodogram consists of the following steps.

1. Form a discrete-time version of xT (t) by sampling x(t) with a sufficiently highsampling rate.

2. Compute a discrete-frequency version of XT (f) by using the fast Fourier transform(FFT) algorithm.

3. Compute the spectral estimate by squaring the magnitudes of the samples of XT (f)and dividing them by the number of samples.

2.7 Noise Processes

The term noise is used to designate unwanted signals that corrupt the desired signalin a communication system. According to the source of noise, we can devide noise intoexternal noise (e.g. atmospheric noise, noise from power lines) and internal noise fromthe communication system itself.

The category of internal noise includes an important class of noise that arises due tospontaneous fluctuations of current or voltage in electrical circuits. This kind of noiseis always present in all communication systems, and represents the basic limitation ondetection (i.e. transmission) of signals. The two most common types of spontaneousfluctuations in electrical circuits are thermal noise and shot noise.

Thermal Noise

• It is due to random motion of electrons in any conductor.

• It has a Gaussian PDF according to the central limit theorem. Note that the numberof electrons involved is quite large, with their motions statistically independent fromone another.

• The noise voltage (in V) across the terminals of a resistor with resistance R (inOhm) has a Gaussian PDF with the mean and variance, denoted by µ and σ2,given by

µ = 0, σ2 =2(πkT )2

3hR,

where k is the Boltzmann’s constant ≈ 1.38×10−23 J/K, h is the Planck’s constant≈ 6.63× 10−34 Js, and T is the absolute temperature in K.

3

The noise PSD (in V2/Hz) is

GN(f) =2Rh|f |

eh|f |/kT − 1.

With |f | ¿ kT/h, we have eh|f |/kT ≈ 1 + h|f |/kT and

GN(f) ≈ 2kTR

For T = 273-373 K (0-100 degree Celcius), kT/h ≈ 1012 Hz. Thus, for all practicalpurposes, the PSD of thermal noise is constant.

Shot Noise

• It is associated with the discrete flow of charge carriers across semiconductor junc-tions or with the emission of electrons from a cathode.

• Shot noise has a Gaussian PDF with zero mean according to the central limittheorem.

• Shot noise has a constant power spectrum, with the noise level being independentof the temperature.

White Noise

Several types of noise sources have constant PSDs over a wide range of frequencies. Sucha noise source is called white noise by the analogy to white light which contains all thefrequencies of visible light.

In general, we write the PSD of white noise as

GN(f) = N0/2,

where the factor 1/2 is included to indicate that half of the power is associated withpositive frequencies while the other half is associated with negative frequencies, so thatthe power passed by an ideal bandpass filter with bandwidth B is given by N0B. Thecorresponding autocorrelation function is

RN(τ) =N0

2δ(τ).

NOTE: White noise is not necessarily Gaussian noise. Conversely, Gaussian noise is notnecessarily white noise.

Consider now a sample of a zero-mean white noise process N(t). The variance of thesample is

E[|N(t)|2] = RN(0) = ∞.

Therefore, white noise has infinite power.

4

Filtered White Noise

Consider now filtered white noise corresponding to the ideal band-limited filter, i.e.

GN(f) =

{N0/2, |f | ≤ B0, otherwise

The filtered noise has the autocorrelation function

RN(τ) = F−1{G(f)} = N0Bsinc(2Bτ).

It follows that the sample of a band-limited zero-mean white Gaussian noise is a zero-mean Gaussian random variable with the variance

E[|N(t)|2] = RN(0) = N0B,

which is also equal to the noise power.More generally, if we pass white noise through an LTI filter with frequency response

H(f), then the filtered noise has the PSD

GN(f) = |H(f)|2N0

2,

and is referred to as colored noise, which is again due to the analogy to colored lightcontaining only some frequencies of visible light.

2.8 Noise Equivalent Bandwidth

There is no unique definition for the bandwidth of a signal or for the bandwidth ofa nonideal filter. One commonly used definition for the bandwidth of a lowpass filter(LPF) is the noise equivalent bandwidth.

When zero-mean white noise with PSD N0/2 is passed through an LTI filter withfrequency response H(f), the average power of the filtered noise is

RN(0) =N0

2

∫ ∞

−∞|H(f)|2df.

On the other hand, the average power of an ideal LPF with the same DC gain |H(0)|and bandwidth B is given by

RN(0) = N0B|H(0)|2.

By equating these two noise powers, we can define the noise equivalent bandwidth ofan arbitrary LPF as

BN =

∫∞−∞ |H(f)|2df

2|H(0)|2

Thus, the noise equivalent bandwidth of an arbitrary LPF is defined as the bandwidthof the ideal LPF that produces the same output power from identical white noise input.The definition can also be extended to bandpass filters in the same fashion.

5

Example 2.8 : Consider a LPF based on the RC circuit with the frequency response

H(f) =1

1 + jf/f0

,

where f0 = 12πRC

. Since H(0) = 1,

BN =1

2

∫ ∞

−∞|H(f)|2df =

1

2

∫ ∞

−∞

1

1 + f 2/f 20

df =

∫ ∞

0

1

1 + f 2/f 20

df.

Setting z = f/f0 yields

BN = f0

∫ ∞

0

1

1 + z2dz = f0 · arctanz|∞0 = f0

π

2=

1

4RC.

The corresponding noise power is

RN(0) = N0BN =4kTR

4RC=

kT

C. ¤

2.8.1 Baseband Communication Model with Additive Noise

Consider a linear communication system that does not include modulation. Such a systemis called a baseband communication system. Noise in signal transmission often has anadditive effect.

For the modeling purpose, we typically combine all noise sources into a single additivenoise source located at the receiver input. Accordingly, we make two assumptions on noisecharacteristics.

1. The noise process is an ergodic process with zero mean and PSD equal to N0/2.

2. The noise process is uncorrelated with the transmitted signal.

Accordingly, the received signal Y (t) is the sum of the transmitted signal X(t) and thenoise N(t), i.e.

Y (t) = X(t) + N(t).

Since X(t) and N(t) are uncorrelated, we have superposition of signal powers, i.e.

RY (0) = RX(0) + RN(0) or equivalently

E[|Y (t)|2] = E

[|X(t)|2] + E[|N(t)|2] .

Define the signal power and the noise power at the receiver as

S = E[|X(t)|2] and N = E

[|N(t)|2] .

In addition, the signal-to-noise ratio (SNR) is defined as

SNR = S/N.

The SNR is an important measure of the degree to which the transmitted signal is con-taminated with additive noise.

In case of white noise with PSD N0/2, the noise power at the receiver output withpower gain GR and noise equivalent bandwidth BN is given by

N = GRN0BN .

Typically, for analytical purposes, it is assumed that additive noise is white and Gaussian,and is referred to as additive white Gaussian noise (AWGN).

6

Review

Handout 111

3 Digital Communication Basics

AWGN Channel

The additive white Gaussian noise (AWGN) channel is the simplest practical mathemat-ical model for describing a communication channel. This model is based on the followingassumptions.

1. The channel bandwidth is unlimited.

2. The attenuation of channel a is time-invariant and constant over all frequencies ofinterest.

3. The channel delays the signal by a constant amount td.

4. The channel adds zero-mean white Gaussian noise N(t) to the transmitted signal.In addition, this noise is uncorrelated with the transmitted signal.

The first tree assumptions indicate that the channel is distortionless over the messagebandwidth W . The response Y (t) of a AWGN channel for a transmitted signal X(t) isgiven by

Y (t) = aX(t− td) + N(t).

If the transmitted signal X(t) has average power SX and message bandwidth W andthe receiver includes an ideal lowpass filter with bandwidth of exactly W , the power ofthe channel output is given by

E[|Y (t)|2] = E

[|aX(t− td)|2]+ E

[|N(t)|2] = a2SX + N0W.

The corresponding signal-to-noise ratio (SNR) is given by

SNR =a2SX

N0W.

Matched Filter

Consider the problem of detecting whether a pulse of a known shape p(t) has beentransmitted or not. Thus, the output of the AWGN channel is given either by

Y (t) = ap(t− td) + N(t)

or byY (t) = N(t).

Without loss of generality, assume that a = 1 and td = 0 in what follows. Assume thatthe receiver structure in figure 3.1 is used.


1

transmitter channel receiver

Figure 3.1: Receiver structure for pulse detection.

In addition, we base our decision about the presence or the absence of p(t) on theoutput Y (t) of the receiver filter h(t) sampled at time instant t = t0. More specifically, ifthe pulse is present,

Y (t0) =

∫ ∞

−∞h(t0 − τ)Y (τ)dτ

=

∫ ∞

−∞h(t0 − τ)p(τ)dτ +

∫ ∞

−∞h(t0 − τ)N(τ)dτ

= p(t0) + N(t0),

where p(t) and N(t) are the filtered pulse and the filtered noise respectively.The key question here is as follows: What is the optimal impulse response of the

receiver filter? Intuitively, the optimal filter (in terms of minimizing the decision errorprobability) should maximize the SNR at t = t0. This SNR can be written as

SNR =|p(t0)|2

E[|N(t0)|2

] =

∣∣∣∫∞−∞ H(f)P (f)ej2πft0df

∣∣∣2

∫∞−∞ |H(f)|2GN(f)df

.

Using the Schwarz’s inequality, the SNR can be upperbounded as follows.

SNR =

∣∣∣∣∫∞−∞ H(f)

√GN(f) P (f)√

GN (f)ej2πft0df

∣∣∣∣2

∫∞−∞ |H(f)|2GN(f)df

≤∫ ∞

−∞

|P (f)|2GN(f)

df.

The above inequality becomes equality when

H(f) = KP ∗(f)

GN(f)e−j2πft0 ,

where K is an arbitrary constant. Note that the optimal filter amplifies frequency com-ponents of the signal and attenuates frequency components of the noise.

In the case of white noise with GN(f) = N0/2, we can write

H(f) = KP ∗(f)

N0/2e−j2πft0 .

In the time domain,

h(t) =2K

N0

p∗(−t + t0).

2

Thus, the optimal impulse response is determined by the pulse shape. In particular, theoptimal impulse response is matched to the pulse shape. For this reason, this optimalfilter is called a matched filter.

Assume that the pulse p(t) is nonzero only in the interval [0, T ]. Substituting theexpression of h(t) into the expression for Y (t0) yields

Y (t0) =

∫ ∞

−∞h(t0 − τ)Y (τ)dτ =

2K

N0

∫ T

0

p∗(τ)Y (τ)dτ.

Note that Y (t0) is the correlation between the transmitted pulse p(t) and the receivedsignal Y (t). The result indicates that we can implement this optimal filtering as a corre-lation receiver, as illustrated in figure 3.2.

transmitter channel receiver

Figure 3.2: Structure of a correlation receiver.

When p(t) is a rectangular pulse of duration T , the correlation filter is equivalent tothe integrate-and-dump (I&D) filter, i.e.

Y (t0) =2K

N0

∫ T

0

Y (τ)dτ.

In practical communication systems, we may not transmit information by using thiskind of on/off system, where a pulse is present or absent. Instead, we may embed theinformation on the amplitude of the transmitted pulse; this technique is referred to aspulse amplitude modulation (PAM). An alternative is to use a set of pulses, with theinformation embedded in the choice of these pulses.

When using multiple pulse shapes, at the receiver, we need one matched filter (orcorrelator) for each possible pulse shape. Then, we can compute the outputs from thesefilters, and decide that the pulse for which the corresponding filter output is maximumwas sent.

Finally, it should be pointed out that the optimality of the matched filter was derivedfor a distortionless channel. For distorting channels, the matched filter must follow thedistorted pulse shape, which is difficult in practice. Instead of changing the matchedfilter, we can perform signal processing to remove the effects of channel distortion; suchprocessing is referred to as equalization.

3

Review

Handout 121

Coherent Detection of Binary Signals in AWGN Channel

In digital communications, detection refers to the decision regarding what data symbolhas been transmitted. In matched filter based detection, it is assumed that the receiverhas complete knowledge of the set of possible transmitted signals, and especially theirtiming. Such detection is called coherent detection.

Consider a scenario in which one data bit is transmitted using two signals p1(t) andp2(t) with finite duration in [0, T ] and with equal energy E, i.e.

E =

∫ T

0

p1(t)p∗1(t)dt

︸︷︷︸=E1

=

∫ T

0

p2(t)p∗2(t)dt

︸︷︷︸=E2

.

Define the correlation coefficient ρ between p1(t) and p2(t) as

ρ =1√

E1E2

∫ T

0

p1(t)p∗2(t)dt =

1

E

∫ T

0

p1(t)p∗2(t)dt.

The receiver consists of two matched filter and is shown in figure 2.3. In particular,the receiver decides that p1(t) was transmitted if the decision parameter Z1 is greaterthan Z2, and vice versa.

transmitter channel

receiver

or

Figure 2.3: Receiver structure for binary detection.

Given that p1(t) is transmitted, the outputs of the two matched filters are

Z1 =

∫ T

0

Y (t)p∗1(t)dt =

∫ T

0

p1(t)p∗1(t)dt +

∫ T

0

N(t)p∗1(t)dt

︸︷︷︸=N1

= E + N1,

Z2 =

∫ T

0

Y (t)p∗2(t)dt =

∫ T

0

p1(t)p∗2(t)dt +

∫ T

0

N(t)p∗2(t)dt

︸︷︷︸=N2

= ρE + N2.

In addition, given that p1(t) is transmitted, a detection error occurs when Z2 > Z1, orequivalently Z = Z2 − Z1 > 0.

1Course notes were prepared by Prof. R.M.A.P. Rajatheva and revised by Dr. Poompat Saengudom-lert. In addition, this handout is the final one.

1

When N(t) is zero-mean Gaussian noise, N1 and N2 are jointly Gaussian randomvariables. We compute the mean and the variance of N1 below.

E[N1] = E

[∫ T

0

N(t)p∗1(t)dt

]=

∫ T

0

E[N(t)]p∗1(t)dt = 0,

var[N1] = E

[(∫ T

0

N(τ)p∗1(τ)dτ

)(∫ T

0

N(η)p∗1(η)dη

)∗]

=

∫ T

0

∫ T

0

E[N(τ)N∗(η)]p∗1(τ)p1(η)dτdη

=

∫ T

0

∫ T

0

N0

2δ(τ − η)p∗1(τ)p1(η)dτdη

=N0

2

∫ T

0

p∗1(η)p1(η)dη =EN0

2.

Similarly, N2 has mean 0 and variance EN0/2. The covariance between N1 and N2 iscomputed as follows.

E[N1N∗2 ] = E

[(∫ T

0

N(τ)p∗1(τ)dτ

)(∫ T

0

N(η)p∗2(η)dη

)∗]

=

∫ T

0

∫ T

0

E[N(τ)N∗(η)]p∗1(τ)p2(η)dτdη

=

∫ T

0

∫ T

0

N0

2δ(τ − η)p∗1(τ)p2(η)dτdη

=N0

2

∫ T

0

p∗1(η)p2(η)dη = ρ∗EN0

2.

Since Z1 = E + N1 and Z2 = ρE + N2, we compute E[Z] to be

E[Z] = E[Z2]− E[Z1] = (ρ− 1)E.

We next compute var[Z]. Note that Z−E[Z] = (ρE+N2−E−N1)−(ρ−1)E = N2−N1.It follows that

var[Z] = E[|N2 −N1|2

]

= E[|N2|2] + E[|N1|2]− E[N1N∗2 ]− E[N2N

∗1 ]

=EN0

2+

EN0

2− ρ∗

EN0

2− ρ

EN0

2= (1− Re{ρ})EN0 = (1− ρ)EN0,

where the last equality follows from a practical assumption that p1(t) and p2(t) are real,and hence ρ is real.

2

Therefore, given that p1(t) is transmitted, the probability of detection error is

Pr{Z > 0|p1(t)} = Pr

Z − (ρ− 1)E√(1− ρ)EN0︸︷︷︸

zero−mean unit−variance Gaussian

>(1− ρ)E√(1− ρ)EN0

∣∣∣∣∣∣∣∣∣p1(t)

= Q

√(1− ρ)E

N0

.

By symmetry, given that p2(t) is transmitted, the probability of detection error is thesame. In summary, the overall bit error probability is

Pe = Q

√(1− ρ)E

N0

.

For ergodic systems, the bit error probability is equal to the bit error rate (BER)which is a key performance measure of a digital communication system. The BER is thethe average number of errors in an indefinitely long sequence of transmitted bits.

It is customary to describe the performance of a digital communication system byplotting the BER against the ratio Eb/N0, where Eb is the average energy used pertransmitted bit. Significant comparisons among different communication systems arepossible using such plots. As a specific example, we shall compare two scenarios ofbinary detection discussed above.

1. Antipodal signals : p2(t) = −p1(t). In this case, ρ = −1.

2. Orthogonal signals : ρ = 0

It follows that

P antipodale = Q

(√2E

N0

), P orthogonal

e = Q(√

E

N0

)

Figure 2.4 indicates that antipodal signals perform better compared with orthogonalsignals. In particular, for the same BER, orthogonal signals require 3 dB more energyper bit than antipodal signals. In other words, there is a 3-dB penality in terms of thesignal energy.

†Appendix: Wiener-Khinchine Theorem

Recall that the PSD is defined as the Fourier transform of the autocorrelation function.The Wiener-Khinchine theorem states that the PSD is indeed equal to the followingquantity, which was previously mentioned as an alternative definition of the PSD.

GX(f) = limT→∞

1

TE

[|XT (f)|2] ,

3

-6

-5

-4

-3

-2

-1

0 2 4 6 8 10 12 14

log 1

0BE

R

Eb/N0 (dB)

orthogonal

antipodal

Figure 2.4: BERs for antipodal and orthogonal signals.

where XT (f) is the Fourier transform of the truncation xT (t) of the sample function x(t),i.e.

xT (t) =

{x(t), |t| ≤ T/20, otherwise

Wiener-Kinchine theorem: For a WSS process X(t),

GX(f) = F{RX(τ)},

provided that∫∞−∞ |τRX(τ)|dτ < ∞.

Proof: We first write2

E[|XT (f)|2] = E [XT (f)X∗

T (f)]

= E

[(∫ T/2

−T/2

x(η)e−j2πfηdη

)(∫ T/2

−T/2

x(ξ)e−j2πfξdξ

)∗]

=

∫ T/2

−T/2

∫ T/2

−T/2

E [x(η)x∗(ξ)] e−j2πf(η−ξ)dηdξ

=

∫ T/2

−T/2

∫ T/2

−T/2

RX(η − ξ)e−j2πf(η−ξ)dηdξ.

Let τ = η − ξ, we can write

E[|XT (f)|2] =

∫ T/2

−T/2

∫ T/2−ξ

−T/2−ξ

RX(τ)e−j2πτdτdξ

2We use x(t) to denote a random process in this section. The capital X(f) is already used to refer toits Fourier transform.

4

Figure 2.5 shows the region of integration in the domain set of (ξ, τ). By changing theorder of integration, we can write

E[|XT (f)|2] =

∫ T

0

∫ T/2−τ

−T/2

RX(τ)e−j2πτdξdτ +

∫ 0

−T

∫ T/2

−T/2−τ

RX(τ)e−j2πτdξdτ

=

∫ T

0

(T − τ)RX(τ)e−j2πτdτ +

∫ 0

−T

(T + τ)RX(τ)e−j2πτdτ

=

∫ T

−T

(T − |τ |)RX(τ)e−j2πτdτ.

Figure 2.5: Region of integration for the derivation of the Wiener-Khinchine theorem.

From the definition of GX(f), we have

GX(f) = limT→∞

1

TE

[|XT (f)|2]

= limT→∞

1

T

∫ T

−T

(T − |τ |)RX(τ)e−j2πτdτ

= limT→∞

∫ T

−T

RX(τ)e−j2πτdτ − limT→∞

1

T

∫ T

−T

|τ |RX(τ)e−j2πτdτ

= F{RX(τ)} − limT→∞

1

T

∫ T

−T

|τ |RX(τ)e−j2πτdτ

Since∫

f(τ)dτ ≤ ∫ |f(τ)|dτ for real f(τ), the real part and the imaginary part of theintegration in the last equality are at most

∫∞−∞ |τRX(τ)|dτ , which is assumed to be finite.

It follows that the limit in the last equality is equal to zero, yielding GX(f) = F{RX(τ)}as desired. ¤

Appendix: Cyclostationary Processes

A random process X(t) is widesense cyclostationary if

X(t) = X(t + nT0),

RX(t, t− τ) = RX(t + nT0, t + nT0 − τ)

5

for all t, τ ∈ R and n ∈ Z. In other words, for any τ ∈ R, X(t) and RX(t, t − τ) asfunctions of t are periodic with period T0.

For a widesense cyclostationary process X(t), the PSD is given by

GX(f) = F{〈RX(t, t− τ)〉}

where

〈RX(t, t− τ)〉 =1

T0

∫ T0/2

−T0/2

RX(t, t− τ)dt

is the average autocorrelation function.

Example 2.9 : Consider Y (t) = X(t) cos(2πf0t), where X(t) is WSS. We compute themean and the correlation function of Y (t) as follows.

Y (t) = E[X(t) cos(2πf0t)] = X(t) cos(2πf0t)

RY (t, t− τ) = E[X(t) cos(2πf0t)X∗(t− τ) cos(2πf0(t− τ))]

= RX(τ) cos(2πf0t) cos(2πf0(t− τ))

= RX(τ)

(cos(2πf0τ) + cos(4πf0t− 2πf0τ)

2

)

Since Y (t) and RY (t, t − τ) are periodic with period T0 = 1/f0, it follows that Y (t) iswidesense cyclostationary.

In addition,

〈RY (t, t− τ)〉 =1

2RX(τ) cos(2πf0τ),

yielding the PSD

GY (f) =1

4GX(f − f0) +

1

4GX(f + f0).

Note that is the same PSD as for Y (t) = X(t) cos(2πf0t + Φ), where Φ is uniformlydistributed in [0, 2π]. ¤

6

1.1 random variables, probability distributions, and proba ...rrajathe/web/ln_0.pdf · 1.1.3...

Documents