chapter 5 probability and statistics

1

CHAPTER 5 PROBABILITY AND STATISTICS

Definition of statistics: The mathematics of the collection, organization and interpretation of numerical data,

especially the analysis of population by inference from sampling

Let denotes a probability of an event A which is a subset of a sample space.

5.1 Rules of probability

1. Complement rule

2. Addition rule

3. For disjoint events, , thus

4. Product rule, If and independent, then .

5.2 Conditional probability

(a) If and are any events with then

(b) If and are any events with then

5.3 Multiplication rule

If and are any events then

5.4 Total probability rule

If are mutually exclusive and exhaustive events, then

kk

k

EPEAPEPEAPEPEAP

EAPEAPEAPAP

2211

21

5.5 Bayes Theorem

If are mutually exclusive events, one of which occurs given that another event occurs, then

Example 5.1 Three machines produce similar car parts. A produces 40% of the total output, machines B and C

produce 25% and 15% respectively. The proportions of the output from each machine that do not conform to the

specification are 10% for A, 5% for B and 1% for C. What proportion of these parts that do not conform to the

specification are produced by machine A?

Solution

Let D represent the event that a particular part is defective. Then the overall proportion of defective parts is

Using Bayes theorem,

Example 5.2 Suppose that 0.1% of the people in a certain area have a disease D and that a mass screening test is

used to detect cases. The test gives either a positive result or a negative result for each person. In practice the test

gives a positive result with probability 99.9% for a person who has D and a probability of 0.2% for a person who has

not. What is the probability that a person for whom the test is positive actually has the disease?

2

Solution

Let T represent the event that the test gives a positive result.

Then,

Using Bayes theorem,

5.6 Random variables

A random variable (rv) has a sample space of possible numerical values together with a distribution of probabilities.

Examples: (a) the number of defectives in a process (b) number of successful projects.

Random variables can be discrete or continuous.

Discrete random variables and distributions

Definition

If X is a discrete random variable, then xXPxp is called a probability mass function or

probability distribution if, for each outcome of x ,

(a) 0xp

(b) x

xp 1

Cumulative distribution functions

The cumulative distribution function, xF for a discrete random variable X with probability distribution

xXPxp is

xt

tXPxXPxF

Properties of the cumulative distribution functions

xF satisfies the following properties:

(a)

xt

tXPxXPxF

(b) 10 xF

(c) If yx , then yFxF

Mean of a discrete random variable

If X is a discrete random variable with probability distribution xXPxp , then the mean or

expected value for X which is denoted by X

or XE is given by

3

x

XxxpXE

Variance of a discrete random variable

If X is a discrete random variable with probability distribution xXPxp , then the variance for

X which is denoted by XV or is given by

x

XXXxpxXEXV

222

Standard deviation of a discrete random variable

The standard deviation of a discrete random variable, denoted as X , is the positive square root for the variance,

2

X .

Example 5.3

The number of successful projects X per day obtained by a small engineering firm can be described by the

following probability distribution:

otherwise0

4,3,2,1,0for10

xx

xXP

Find the cumulative distribution function for X . Find the mean and variance for the number of successful projects

per day.

Solution

The cumulative distribution function for X is given by

xt

XXtXPxXPxF

For 0x , 010

0000 PXPF

For 1x , 1011 PPXPF

1.010

10

For 2x , 21022 PPPXPF

3.010

2

10

10

For 3x , 321033 PPPPXPF

4

6.010

3

10

2

10

10

For 4x , 4321044 PPPPPXPF

0.110

4

10

3

10

2

10

10

5.7 Continuous random variables and distributions

Definition

If X is a continuous random variable defined over a set of real numbers, then xf is called a probability

density function, if

(a) 0xf

(b)

1dxxf

(c) b

a

dxxfbXaP where X lies in the interval ba,

Cumulative distribution functions

The cumulative distribution function, xF for a continuous random variable X with probability density

function xf is

x

dttfxXPxF for x

Properties of the cumulative distribution functions

xF satisfies the following properties:

(a)

a

dttfaXP for x

(b)

a

dttfaXP for x

(c) b

a

dxtfbXaP for x

Mean of a continuous random variable

If X is a continuous random variable with probability density function xf , then the mean or expected value

for X which is denoted by X

or XE is given by

5

dxxxfXEX

Variance of a continuous random variable

If X is a continuous random variable with probability density function xf , then the variance for X which is

denoted by XV or 2

X is given by

22

2

22

XX

XX

XX

dxxfx

dxxfx

XEXV

Standard deviation of a continuous random variable

The standard deviation of a continuous random variable, denoted asX

, is the positive square root for the variance,

2

X .

Example 5.4 Assume that the particle size of an air pollutant (in micrometers) can be described by the following

probability function:

otherwise0

1for3

4x

xxf X

(a) Show that the xf is a probability density function

(b) Find the cumulative distribution function

(c) Determine the mean and standard deviation

Solution

(a) xf is a probability density function if it satisfies

1dxxf .

Here

1

4

3dx

xdxxf X

11

3 x

Therefore xf is a probability density function.

(b) The cumulative distribution function for X is given by

x

XX dttfxXPxF for x

6

x

dxx

1

4

3

x

x1

3

1

33

111

1

xx

(c) The mean for X is given by

dxxxfXEX

1

4

3dx

xx

1

3

3dx

x

1

22

13

x

smicrometer2

3

The variance for X is given by

222

XX dxxfxXV

2

1

4

2

2

33

dxx

x

2

1

2 2

33

dxx

2

1 2

33

x

smicrometersq.4

3

4

93

5.8 Discrete distributions

Bernoulli distribution

PMF xx ppxXP

1

1

Range 1,0x and 10 p

Mean p

Variance pp 1

7

Binomial distribution

PMF xnx ppx

nxXP

1

Range nx ,,1,0 and 10 p

Parameters n and p

Mean np

Variance pnp 1

Example 5.5 Suppose a road is flooded with probability during a year and not more than one flood occurs

during a year. What is the probability that it will be flooded at least once during a five year period?

Solution Let X be the event a flood occurs in a year.

Then,

Poisson distribution

PMF !

e

xxXP

x

Range ,2,1,0x

Parameter

Mean

8

Variance

If and , the binomial distribution can be approximated by the Poisson distribution with .

Example 5.6 The number of flaws for a thin copper wire follows a Poisson distribution with a mean of 2.3 flaws per

mm. (a)Determine the probability of exactly two flaws in 1mm of wire. (b)Determine the probability of ten flaws in

5mm of wire.

Solution

(a) Let X be the number of flaws in 1mm of wire.

Given that , thus

(a) Let X be the number of flaws in 5mm of wire. Then X has a Poisson distribution with flaws.

5.9 Continuous distribution

Normal distribution

PDF

2

2

1exp

2

1

xxf

Range 0,0, x

Parameters : location parameter, : scale parameter

If X follows a normal distribution then .

Also,

5.10 Sample measures and parameter estimates

Let n

XXX ,,,21 be a random sample from a population with mean and variance

2 . Then the point

estimate for and are

x̂ where

n

x

n

xxxx

n

ii

n

121

is the sample mean

z

6420-2-4-6

f(x)

.5

.4

.3

.2

.1

0.0

9

And

22ˆ s where

N

ii

xxn

s1

22

1

1is the sample variance.

Thus if then

.

5.11 Confidence interval for the mean based on the normal distribution

(1)Population variance is known

The %1100 confidence interval for the mean is given by

nzX

nzX

22

where

(a) X is the sample mean.

(b) 2

z is the th

2100

quantile of the standard normal distribution which is given in Table 1.

Assumptions:

(a) n

XXX ,,,21 is the random sample of size n from a population which has a normal distribution

with mean and variance 2 .

(b) The sample size n can either be small or large.

(2)Population variance is unknown


n

SzX

n

SzX

22

where

(a) X is the sample mean and S is the sample standard deviation.

(b) 2

z is the th

2100


Assumptions:

(a) n

XXX ,,,21 is the random sample of size n from a population which has a normal distribution with

mean and variance 2 .

(b) The sample size n is large.

Table 1: Cumulative distribution function for the standard normal distribution

10

z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09

0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359

0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753

0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141

0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517

0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879

0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224

0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549

0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852

0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133

0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389

1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621

1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830

1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015

1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177

1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319

1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441

1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545

1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633

1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706

1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767

2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817

2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857

2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890

2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916

2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936

2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952

2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964

2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974

2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981

2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986

3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990

3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993

3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995

3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997

3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998

Example 5.7 A research was done to determine the wind speed distribution in Penang. The following monthly wind speed data

(measured in m/s) was obtained.

0

zx

dxzZP2

2

1

e2

1

11

15.42 12.85 10.28 13.36 15.42 20.56 16.28 25.70 15.42 9.25

10.28 9.25 8.22 11.31 14.91 16.45 13.36 15.42 13.36 12.85

11.31 11.31 12.85 11.82 14.39 15.42 16.96 21.59 15.42 15.42

12.85 12.85 11.82 14.39 12.34 24.67 12.85 20.05 27.24 22.62

Find a 90% confidence interval for the true mean wind speed in Penang.

Solution

Let be the true mean wind speed (in m/s) in Penang.

Since the sample size is large 40n , the following confidence interval is used.

Thus the %90 confidence interval for the true population means is given by

n

SzX

n

SzX

22

n

SzX

n

SzX

05.005.0

40

489.465.1953.14

40

489.465.1953.14

710.065.1953.14710.065.1953.14

172.1953.14172.1953.14

125.16781.13

Calculations

953.1440

62.2242.1528.1042.15

40

4021

XXX

X

149.20489.4

39

953.1462.22953.1428.10953.1442.15

39

1

2

22240

1

22

ii

XXS

From Table 1, 65.105.0

z

Example 5.8 The flow discharge of Sungai Kerian (measured in m

3/s) was obtained at random. 50 readings were collected and the

mean flow discharge was found to be 3.512m3/s with a standard deviation of 0.5 m

3/s. Construct a 99% confidence

interval for the true mean flow discharge of Sungai Kerian.

Solution

Let be the true mean flow discharge of Sungai Kerian.

Since the sample size is large 50n , the following confidence interval is used.

12


n

SzX

n

SzX

22

n

SzX

n

SzX

005.0005.0

50

5.057.2512.3

50

5.057.2512.3

071.057.2512.3071.057.2512.3

182.0512.3182.0512.3

694.3330.3

Calculations

512.3X 50n 5.0S . From Table 1, 57.2005.0

z

5.12 Confidence intervals for the mean based on the t distribution


n

StX

n

StX

nn 1,2

1,2

where

(a) X is the sample mean.

(b) S is the sample standard deviation.

(c) 1,

2n

t

is the th2

100

quantile of the t distribution with 1n degrees of freedom. The critical

values of the t distribution is given in Table 2.

Assumptions:

(a) n

XXX ,,,21 is the random sample of size n from a population which has a

normal distribution with mean and variance 2 .

(b) The sample size n is small.

Example 5.9

The moisture content (measured in percentage) of clay in Batu Ferringhi was investigated. The following data was

obtained from a random sample.

1.81 2.00 2.74 3.56 2.13

4.64 3.64 4.62 4.47 3.12

Construct a 98% confidence interval for the true moisture content for clay by assuming that the sample is from a

normal distribution.

Solution

Let be the true mean moisture content (in percentage) for clay.

13

Since the sample size is small 10n , the following confidence interval is used.


n

StX

n

StX

nn 1,2

1,2

n

StX

n

StX

9,01.09,01.0

10

091.1821.2273.3

10

091.1821.2273.3

345.0821.2273.3345.0821.2273.3

973.0273.3973.0273.3

246.4300.2

Calculations

273.310

12.313.264.481.1

10

0021

XXX

X

190.1091.19

273.312.3273.364.4273.381.1

9

1 2

22210

1

22

ii

XXS

From Table 2, 821.29,01.0

t

Table 2: Critical values for the t distribution with degrees of freedom

0 t

14

0.40 0.30 0.20 0.15 0.10 0.05 0.025 0.02 0.015 0.01

1 0.325 0.727 1.376 1.963 3.078 6.314 12.706 15.895 21.205 31.821

2 0.289 0.617 1.061 1.386 1.886 2.920 4.303 4.849 5.643 6.965

3 0.277 0.584 0.978 1.250 1.638 2.353 3.182 3.482 3.896 4.541

4 0.271 0.569 0.941 1.190 1.533 2.132 2.776 2.999 3.298 3.747

5 0.267 0.559 0.920 1.156 1.476 2.015 2.571 2.757 3.003 3.365

6 0.265 0.553 0.906 1.134 1.440 1.943 2.447 2.612 2.829 3.143

7 0.263 0.549 0.896 1.119 1.415 1.895 2.365 2.517 2.715 2.998

8 0.262 0.546 0.889 1.108 1.397 1.860 2.306 2.449 2.634 2.897

9 0.261 0.543 0.883 1.100 1.383 1.833 2.262 2.398 2.574 2.821

10 0.260 0.542 0.879 1.093 1.372 1.813 2.228 2.359 2.528 2.764

11 0.260 0.540 0.876 1.088 1.363 1.796 2.201 2.328 2.491 2.718

12 0.259 0.539 0.873 1.083 1.356 1.782 2.179 2.303 2.461 2.681

13 0.259 0.538 0.870 1.080 1.350 1.771 2.160 2.282 2.436 2.650

14 0.258 0.537 0.868 1.076 1.345 1.761 2.145 2.264 2.415 2.625

15 0.258 0.536 0.866 1.074 1.341 1.753 2.131 2.249 2.397 2.603

16 0.258 0.535 0.865 1.071 1.337 1.746 2.120 2.235 2.382 2.584

17 0.257 0.534 0.863 1.069 1.333 1.740 2.110 2.224 2.368 2.567

18 0.257 0.534 0.862 1.067 1.330 1.734 2.101 2.214 2.356 2.552

19 0.257 0.533 0.861 1.066 1.328 1.729 2.093 2.205 2.346 2.540

20 0.257 0.533 0.860 1.064 1.325 1.725 2.086 2.197 2.336 2.528

21 0.257 0.532 0.859 1.063 1.323 1.721 2.080 2.189 2.328 2.518

22 0.256 0.532 0.858 1.061 1.321 1.717 2.074 2.183 2.320 2.508

23 0.256 0.532 0.858 1.060 1.320 1.714 2.069 2.177 2.313 2.500

24 0.256 0.531 0.857 1.059 1.318 1.711 2.064 2.172 2.307 2.492

25 0.256 0.531 0.856 1.058 1.316 1.708 2.060 2.167 2.301 2.485

26 0.256 0.531 0.856 1.058 1.315 1.706 2.056 2.162 2.296 2.479

27 0.256 0.531 0.855 1.057 1.314 1.703 2.052 2.158 2.291 2.473

28 0.256 0.530 0.855 1.056 1.313 1.701 2.048 2.154 2.286 2.467

29 0.256 0.530 0.854 1.055 1.311 1.699 2.045 2.150 2.282 2.462

30 0.256 0.530 0.854 1.055 1.310 1.697 2.042 2.147 2.278 2.457

40 0.255 0.529 0.851 1.050 1.303 1.684 2.021 2.123 2.250 2.423

60 0.254 0.527 0.848 1.046 1.296 1.671 2.000 2.099 2.223 2.390

120 0.254 0.526 0.845 1.041 1.289 1.658 1.980 2.076 2.196 2.358

0.253 0.524 0.842 1.036 1.282 1.645 1.960 2.054 2.170 2.326

5.13 Tests of hypotheses for the mean based on the normal distribution

(1)Population variance is known

One tail tests Two tail tests

01

01

00

:

:

:

dH

dH

dH

01

00

:

:

dH

dH

Test statistic

15

n

dXZ

2

0

Rejection region

Reject

0H if

zZ

(or

zZ )

2

zZ

Notes:

(a) 0

d is a constant.

(b) X is the sample mean.

(c) 2

z is the th

2100


Assumptions:

(a) n

XXX ,,,21 is a random sample of size n from a population which has a normal distribution with


(b) The sample size n can either be small or large.

2 Population variance is unknown


01

01

00

:

:

:

dH

dH

dH

01

00

:

:

dH

dH

Test statistic

n

S

dXZ

2

0

Rejection region

Reject

0H if

zZ

(or

zZ )

2

zZ

Notes:

(a) 0

d is a constant.

16

(b) X is the sample mean and S is the sample standard deviation.

(c) 2

z is the th

2100


Assumptions:

(a) n



(b) The sample size n is large.

Example 5.10

A research was done to determine the wind speed distribution in Penang. The following monthly wind speed data

(measured in m/s) was obtained.

15.42 12.85 10.28 13.36 15.42 20.56 16.28 25.70 15.42 9.25

10.28 9.25 8.22 11.31 14.91 16.45 13.36 15.42 13.36 12.85

11.31 11.31 12.85 11.82 14.39 15.42 16.96 21.59 15.42 15.42

12.85 12.85 11.82 14.39 12.34 24.67 12.85 20.05 27.24 22.62

Can you conclude that the mean wind speed in Penang is less than 12m/s? Use 10.0 .

Solution

We will follow the six step procedure to solve this problem.

Step 1: Define the population parameter of interests.

Let be the true mean wind speed (in m/s) in Penang.

Since the sample size is large 40n , the following hypothesis test is used.

Step 2 : Define the null and alternative hypotheses

12:

12:

1

0

H

H

Step 3 : Calculate the test statistic

n

S

dXZ

2

0

40

149.20

12953.14 Z

710.0

953.2Z

159.4Z

Calculations

17

953.1440

62.2242.1528.1042.15

40

4021

XXX

X

149.20489.4

39

953.1462.22953.1428.10953.1442.15

39

1

2

22240

1

22

ii

XXS

Step 4 : Determine the rejection region

Reject 0

H if 28.110.0

zzZ

(From Table 1).

Step 5 : Result

The null hypothesis cannot be rejected.

Step 6 : Conclusion

At 10.0 , there is insufficient evidence to show that the true mean wind speed (in m/s) in Penang is less

than 12m/s.

Example 5.11

The flow discharge of Sungai Kerian (measured in m3/s) was obtained at random. Fifty readings were collected and

the mean flow discharge was found to be 3.512m3/s with a standard deviation of 0.5 m

3/s. Show that the true mean

flow discharge at Sungai Kerian is not equal to 4 m3/s. Use 05.0 .

Solution



Let be the true mean flow discharge of Sungai Kerian.

Since the sample size is large 50n , the following hypothesis test is used.


4:

4:

1

0

H

H


n

S

dXZ

2

0

where 50,25.0,512.3 2 nSX

50

25.0

4512.3 Z

18

071.0

488.0Z

873.6Z


Reject 0H if 96.1

025.02

zzZ

or 96.1025.0

2

zzZ

(From Table 1)

Step 5 : Result

The null hypothesis is rejected.

Step 6 : Conclusion

At 10.0 , there is sufficient evidence to show that the true mean flow discharge of Sungai Kerian is not

equal to 4 m3/s.

5.14 Test of hypothesis for the mean based on the t distribution


01

01

00

:

:

:

dH

dH

dH

01

00

:

:

dH

dH

Test statistic

n

S

dXT

2

0

Rejection region

Reject

0H if

1,

ntT

(or 1, ntT )

1,

2

n

tT

Notes:

(a) 0

d is a constant.

(b) X is the sample mean.

(c) S is the sample standard deviation.

(d) 1,

2n

t

is the th2

100

quantile of the t distribution with 1n degrees of freedom. The critical values

of the t distribution is given in Table 2.

Assumptions:

19

(a) n



(b) The sample size n is small.

Example 5.12

The moisture content (measured in percentage) of clay in Batu Ferringhi was investigated. The following data was

obtained from a random sample.

1.81 2.00 2.74 3.56 2.13

4.64 3.64 4.62 4.47 3.12

Is the moisture content greater than 3.0%? Use 05.0 .

Solution



Let be the true mean moisture content (in percentage) for clay.

Since the sample size is small 10n , the following hypothesis test is used.


0.3:

0.3:

1

0

H

H


n

S

dXT

2

0

9

190.1

0.3273.3 T

364.0

273.0T

750.0T

Calculations

273.310

12.313.264.481.1

10

0021

XXX

X

190.1091.1

9

273.312.3273.364.4273.381.1

9

1 2

22210

1

22

i

i XXS


Reject 0H if 833.19,05.01, ttT n (From Table 2).

Step 5 : Result

The null hypothesis cannot be rejected.

20

Step 6 : Conclusion

At 10.0 , there is insufficient evidence to show that the true mean moisture content (in percentage) for clay is

greater than 3%.

5.15 Sample correlation

Correlation measures the linear relationship between two variables, X andY .

The sample correlation coefficient of n pairs of observations nn

yxyxyx ,,,,,,2211 denoted by

r is given by

n

YY

n

XX

n

YXYX

YYXX

YYXXr

n

iin

ii

n

iin

ii

n

ii

n

iin

iii

n

ii

n

ii

n

iii

2

1

1

2

2

1

1

2

11

1

1

2

1

2

1ˆ

The strength of the linear relationship is determined by the following:

If 00.180.0 r then the relationship is very strong.

If 79.060.0 r then the relationship is strong.

If 59.040.0 r then the relationship is moderate.

If 39.020.0 r then the relationship is weak.

If 19.000.0 r then the relationship is very weak.

21

Example 5.13

The cost, Y of a manufacturing product usually depends on the lot size, X . The following data on the cost of the

manufacturing product and its lot size is given below:

Y 30 70 140 270 530 1000 2000 3000

X 1 5 10 25 50 100 250 500

Find the value of the correlation coefficient for the above data.

Solution

The correlation coefficient between Y and X is given by

n

YY

n

XX

n

YXYX

rn

iin

ii

n

iin

ii

n

ii

n

iin

iii

2

1

1

2

2

1

1

2

11

1

8

704014379200

8

941325751

8

70409412135030

22

22

1326696

1306950

8.286075.463

1306950

985.0

Therefore, there is a very strong linear relationship between cost and lot size.

Calculations

8n

9418

1

i

iX , 7040

8

1

i

iY , 2135030

8

1

i

iiYX , 00.325751

8

1

2 i

iX ,

143792008

1

2 i

iY

5.16 Simple linear regression

Let nn

YXYXYX ,,,,,,2211 be n pairs of random variables. Then the simple linear regression

model is given by

niXYiii

,,2,110

where

i

Y is the dependent or response variable

i

X is the independent or regressor or explanatory or predictor

variable

0

is the intercept of the regression model

1

is the slope of the regression model

i is the random error term

Assumptions

The assumptions of the random error term are:

(a) 0i

E

(b) 2

ciV (a constant)

(c) The probability distribution is normal

(d) Random error term is independent

Method of least squares

The method of least squares can be used to estimate the values of the intercept (0

) and slope (1

) parameters.

This method minimizes the sum of squares of the random error term, that is

n

iii

n

ii

XYL1

2

101

2 minmin

23

Hence,

0ˆˆ21

10

0

n

iii

XYL

0ˆˆ21

10

1

n

iiii

XXYL

Simplifying yields,

n

ii

n

ii

YXn11

10

ˆˆ

i

n

ii

n

ii

n

ii

XYXX

11

2

11

0

ˆˆ

Solving the two equations yield,

XY10

ˆˆ and

n

XX

n

XYXY

n

iin

ii

n

i

n

ii

n

ii

ii

2

1

1

2

1

11

1

ˆ

where

n

YY

n

ii

1and

n

XX

n

ii

1.

Thus the fitted or estimated regression model is

niXYii

,,2,1ˆˆˆ10

24

iiiYYe ˆ is called the residual.

Example 5.14

The yield of a chemical process (in percentage) is hypothesized to be linearly related with the amount of catalyst (in

grams). Let Y denote the yield of the chemical process and X be the amount of catalyst. The data is given below.

X 0.9 1.4 1.6 1.7 1.8 2.0 2.1

Y 60.54 63.86 63.76 60.15 66.66 71.66 70.81

Fit a simple linear regression model.

Solution

The following simple linear regression model is fitted

7,,2,110

iXYiii

where

i

Y is the yield of a chemical process

i

X is the amount of catalyst

By using the least squares method, the estimates for 0

and 1

are

8929.1887.19

5086.75117.760ˆ2

1

1

2

1

11

1

n

XX

n

XYXY

n

iin

ii

n

i

n

ii

n

ii

ii

8644.89771.0

6614.8

And

7844.505642.143486.65643.18644.83486.65ˆˆ10

XY

Therefore the fitted simple linear regression model is ii

XY 864.8784.50 for 7,,2,1 i

Example 5.15

A study was conducted to determine the relationship between bridge pier scour depths, D and discharge intensity,

q . A simple linear regression model of the form 1

0

qD was proposed. The following data was obtained:

D q D q D q D q

35.67 52.51 12.62 11.99 20.73 25.56 11.48 13.22

31.71 52.04 9.76 10.33 11.24 7.39 8.71 11.21

17.84 22.58 8.54 8.36 8.80 6.71 4.94 2.61

14.63 8.51 13.87 8.24 12.44 13.28 10.07 13.21

25

12.71 11.15 11.60 6.29 9.20 6.49 5.50 1.62

13.72 13.75 19.51 22.03 9.76 6.42 7.13 7.72

12.88 14.31 11.89 11.15 11.42 7.78 6.85 4.68

19.35 9.20 13.72 18.59 11.22 11.85 4.00 3.40

11.92 8.60 11.89 13.66 10.47 9.78 4.07 4.00

14.98 11.43 12.80 15.99 9.48 7.48 4.08 3.18

Determine the simple linear regression model for this problem.

Solution

The proposed model is given by

1

0

qD

The above model can be transformed into a simple linear regression model by taking natural logarithm as follows:

1

0lnln

qD

1lnlnln0

qD

qD lnlnln10

Letting DYi

ln ,00

ln and qXi

ln , we will obtain the following linear regression model

40,,2,110

iXYiii

The following data gives the new values for DYi

ln and qXi

ln

iY iX iY iX iY iX iY iX

3.57 3.96 2.54 2.48 3.03 3.24 2.44 2.58

3.46 3.95 2.28 2.34 2.42 2.00 2.16 2.42

2.88 3.12 2.14 2.12 2.17 1.90 1.60 .96

2.68 2.14 2.63 2.11 2.52 2.59 2.31 2.58

2.54 2.41 2.45 1.84 2.22 1.87 1.70 .48

2.62 2.62 2.97 3.09 2.28 1.86 1.96 2.04

2.56 2.66 2.48 2.41 2.44 2.05 1.92 1.54

2.96 2.22 2.62 2.92 2.42 2.47 1.39 1.22

2.48 2.15 2.48 2.61 2.35 2.28 1.40 1.39

2.71 2.44 2.55 2.77 2.25 2.01 1.41 1.16

By using the least squares method, the estimates for 0

and 1

are

1615.20725.226

4492.21809.230ˆ2

1

1

2

1

11

1

n

XX

n

XYXY

n

iin

ii

n

i

n

ii

n

ii

ii

26

6098.00885.19

6408.11

And

012.13877.13997.22757.26098.03997.2ˆˆ10

XY

Here 00

ln

So 7511.2012.1

0

0

ee

Therefore the fitted model is 6098.0

07511.21 qqD

for 40,,2,1 i

Calculations

3997.240

99.95

40

41.140.146.357.3

40

40

1

i

iY

Y

2757.240

03.91

40

16.139.195.396.3

40

40

1

i

iX

X

09.23016.141.139.140.195.346.396.357.340

1

i

iiXY

4492.218

40

9697.8737

40

03.9199.95

40

1

40

1

n

XYi

ii

i

22222225

1

2 16.139.122.112.395.396.3

i

iX

25.22634.192.150.172.962.1569.15

1615.207

40

4609.8286

40

03.91 2

240

1

n

Xi

i

chapter 5 probability and statistics

Documents

discrete random variable

discrete random variables

probability distribution

random variable rv

rules of probability

conditional probability

total probability rule

probability mass function