chee6330 lecturenotes learning from data(2)

163
CHEE 6330 Foundation of Mathematical Methods in Chemical Engineering Learning from Data Lecture Notes © Michael Nikolaou Chemical & Biomolecular Engineering Department

Upload: deep-joshi

Post on 28-Jan-2016

241 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6330 Foundation of Mathematical Methods in Chemical Engineering

Learning from Data

Lecture Notes

© Michael Nikolaou Chemical & Biomolecular Engineering Department

Page 2: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 2 -

How To Use These Notes:

As the cover page suggests, this text is a set of Lecture Notes, NOT a textbook!

A number of sources, in addition to the textbook, as well as the author’s personal experience have served as basis.

While certain topics covered in detail in the required textbook of the course are presented rather telegraphically, others are elaborated on, particularly when they refer to material not

covered in the textbook. In many places throughout the notes some space has been intentionally left blank, for the

student to understand a certain topic by being forced to fill in the missing material. That is frequently done during lecture time. That is why lectures are important!

In many other places assignments are given for Homework Not To Hand In (HWNTHI). Please make sure that you complete all of that!

The examples have been carefully selected to correspond to a variety of problems of interest to the evolving nature of chemical engineering. While the emphasis in these examples i s on mathematical methods, the physical picture in the background is also important and should be understood.

There are three basic software tools used throughout: Matlab, Mathematica, and Excel. Each does certain tasks particularly well, while being adequate for others. While the particular

software or programming language a student learns is not important, it is important to be familiar with at least one basic computational tool, along with the mathematical and

programming principles of computation. The code included with some examples is intentionally kept simple, to illustrate concepts.

Professional code is a lot more complicated, although the numerical recipe involved is usually not very different.

Page 3: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 3 -

Table of Contents

1. IN LIEU OF PREFACE..............................................................................................................................................................................5

2. WHAT MAY BE LEARNED FROM DATA? ...........................................................................................................................................6

3. AD HOC METHODS FOR PARAMETER ESTIMATION ................................................................................................................... 12

4. QUICK OVERVIEW OF REGRESSION(LEAST-SQUARES)BASICS .................................................................................................. 20

4.1 LEAST-SQUARES METHODS .................................................................................................................................................................. 20 4.2 LINEAR REGRESSION ............................................................................................................................................................................ 21

4.2.1 Straight line fit..................................................................................................................................................................... 21 4.2.2 Some questions raised in EXAMPLE 20 ........................................................................................................................... 24 4.2.3 Basis for answers to preceding questions (in the sequel) ............................................................................................ 25 4.2.4 Multiple linear regression ................................................................................................................................................. 26

4.3 NONLINEAR REGRESSION ..................................................................................................................................................................... 31 4.3.1 The Gauss-Newton method .............................................................................................................................................. 37

5. BACKGROUND ON PROBABILITYDISTRIBUTIONS ....................................................................................................................... 40

5.1 UNIVARIATE DISTRIBUTIONS ................................................................................................................................................................ 41 5.1.1 The normal distribution (EXAMPLE 27) ........................................................................................................................... 49 5.1.2 The binomial distribution (EXAMPLE 26) ........................................................................................................................ 53

5.2 MULTIVARIATE DISTRIBUTIONS ............................................................................................................................................................ 54 5.2.1 The multivariate normal distribution .............................................................................................................................. 60

5.3 IMPORTANCE OF NORMAL DISTRIBUTION.............................................................................................................................................. 65

6. SAMPLE STATISTICS ........................................................................................................................................................................... 70

6.1 POINT ESTIMATION ............................................................................................................................................................................. 71 6.1.1 Population average ( ) estimation .............................................................................................................................. 72

6.1.2 Population variance 2( ) estimation............................................................................................................................ 73

6.2 INTERVAL ESTIMATION ........................................................................................................................................................................ 77 6.2.1 Confidence interval for estimate of population average ( ) : The easy way (good for large samples) ........... 77

6.2.2 Confidence interval for estimate of population average ( ) : The right way (good for both small and large

samples) ............................................................................................................................................................................................... 81 6.2.3 Selecting the number of measurements ......................................................................................................................... 85 6.2.4 Detecting measurement outliers ..................................................................................................................................... 85

6.2.5 Confidence interval estimation for population variance 2( ) .................................................................................. 88

7. PROPAGATION OF ERRORS.............................................................................................................................................................. 93

7.1 LINEAR MODEL.................................................................................................................................................................................... 94 7.2 NONLINEAR MODEL ............................................................................................................................................................................ 94

8. REGRESSION & CORRELATION ........................................................................................................................................................ 96

8.1 LINEAR REGRESSION = LINEAR LEAST SQUARES ...................................................................................................................................... 99 8.2 PROPERTIES OF LEAST-SQUARES ESTIMATORS ..................................................................................................................................... 105 8.3 CONFIDENCE INTERVALS IN LEAST SQUARES FOR STRAIGHT LINE ........................................................................................................... 107 8.4 REPEATED MEASUREMENTS AND LACK OF FIT...................................................................................................................................... 112 8.5 CORRELATION ................................................................................................................................................................................... 114 8.6 MULTIPLE LINEAR REGRESSION .......................................................................................................................................................... 116

8.6.1 General least squares....................................................................................................................................................... 117 8.6.2 Polynomial least squares................................................................................................................................................. 119 8.6.3 Multiple linear least squares .......................................................................................................................................... 121

8.7 CONFIDENCE INTERVALS AND HYPOTHESIS TESTING IN MULTIPLE LINEAR REGRESSION ........................................................................... 123 8.8 NONLINEAR REGRESSION ................................................................................................................................................................... 126

Page 4: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 4 -

8.9 BUILDING MODELS: FITTING DATA VS. MAKING PREDICTIONS.............................................................................................................. 132

9. DESIGN OF EXPERIMENTS FOR EMPIRICAL MODELING........................................................................................................... 139

9.1 BASICS ............................................................................................................................................................................................. 139 9.2 EXPERIMENT DESIGN ......................................................................................................................................................................... 140 9.3 COMPREHENSIVE VS. SEQUENTIAL APPROACH TO EXPERIMENTAL INVESTIGATIONS ............................................................................... 145 9.4 CRITIQUE ON MATHEMATICAL THEORIES OF OPTIMAL EXPERIMENT DESIGN [BHH, CHAPTER 9]............................................................ 147 9.5 TWO-LEVEL FACTORIAL DESIGNS ........................................................................................................................................................ 148

9.5.1 What is a factorial experiment? ..................................................................................................................................... 148 9.5.2 Why factorial designs at two levels? ............................................................................................................................. 150 9.5.3 What are deviation variables and why use them? ..................................................................................................... 150 9.5.4 Why factorial and not one-at-a-time experiments? ................................................................................................... 153 9.5.5 Effects of factors and interactions among factors...................................................................................................... 156 9.5.6 Analysis of factorials through visual inspection .......................................................................................................... 162

Notation: Uppercase, boldface: Matrices. e.g. M Lowercase, boldface: vectors. e.g. v Lowercase, italics: scalars. e.g. f

Uppercase, italics: Random variables. e.g. X

Page 5: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 5 -

1. IN LIEU OF PREFACE

In November 1964 Richard P. Feynman (1918-1988), one of the most brilliant theoretical physicists, Nobel laureate, and distinguished educator, was invited to deliver the Messenger Lectures at Cornell University. This is an excerpt from his lectures taken from "The Character of Physical Law" by Richard P. Feynman, MIT Press, 19671:

To summarize, I would use the words of Jeans, who said that "the Great Architect

seems to be a mathematician". To those who do not know mathematics it is difficult to get

across a real feeling as to the beauty, the deepest beauty, of nature. C.P. Snow talked about

two cultures. I really think that those two cu ltures separate people who have and people who

have not had this experience of understanding mathematics well enough to appreciate nature

once.

It is too bad that it has to be mathematics, and that mathematics is hard for some

people. It is reputed - I do not know if it is true - that when one of the kings was trying to

learn geometry from Euclid he complained that it was difficult. And Euclid said, "There is no

royal road to geometry". And there is no royal road. Physicists cannot make a conversion to

any other language. If you want to learn about nature, to appreciate nature, it is necessary to

understand the language that she speaks in. She offers her information only in one form; we are

not so unhumble as to demand that she change before we pay any attention.

All the intellectual arguments that you can make will not communicate to deaf ears

what the experience of music really is. In the same way all the intellectual arguments in the

world will not convey an understanding of nature to those of "the other culture". Philosophers

may try to teach you by telling you qualitatively about nature. I am trying to describe her.

But it is not getting across because it is impossible. Perhaps it is because their horizons are

limited in the way that some people are able to imagine that the center of the universe is man.

Also of interest on the subject: Eugene Wigner, "The Unreasonable Effectiveness of Mathematics in the Natural Sciences," Communications in Pure and Applied Mathematics, vol. 13, No. I (February 1960).2

1http://www.physicsteachers.com/pdf/The_Character_of_Physical_Law.pdf 2 http://www.dartmouth.edu/~matc/MathDrama/reading/Wigner.html

Page 6: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 6 -

2. WHAT MAY BE LEARNED FROM DATA?

In increasing difficulty:

Parameter values

Relationships among variables

Governing “laws”

- Ad hoc calculations may be used to estimate parameter values based on heuristics, often with the help of a plot.

- Statistical methods may be used to provide more accurate estimates (i.e. with minimized effect of measurement noise), based on well prescribed techniques.

- Heuristics along with statistics can be used to design experiments (when feasible) that allow parameter estimation with small error.

- Developing the form of a relationship among variables is a much more challenging problem than mere parameter estimation.

- For some relationships among variables, the precision is so high, the agreement with other facts so deep, and the validity so universal, that we can call such relationships scientific laws.

- Developing scientific laws is highly unpredictable and rather infrequent. - Developing mathematical models based on established scientific laws and available data is now

done routinely.

Page 7: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 7 -

EXAMPLE 1 – MEASURE FLUID VISCOSITY USING A U-TUBE VISCOMETER

Infer viscosity by measuring the time for the liquid level to drop from A to B as a liquid flows through the capillary to the collection bulb.

Figure 1. U-tube viscometer (image source: Wikipedia)

EXAMPLE 2 – EFFECT OF HUMIDITY ON SOLVENT EVAPORATION

The effect of humidity ( x ) on the extent of solvent evaporation ( y ) in water-reducible paints during

sprayout is assumed to be y ax b . The following data have been collected.3

- What are the best estimates of ,a b ?

- Does humidity have an effect on the extent of solvent evaporation?

Figure 2. Collected data for solvent evaporation at different humidity levels.

3 Data from Journal of Coating Technology, 65, 1983 via Milton and Arnold.

30 40 50 60 700

2

4

6

8

10

12

14

Relative humidity,

Solv

ent

evap

orat

ion

,w

t

Page 8: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 8 -

EXAMPLE 3 – PARAMETERS FOR GROWTH AND SATURATION OF BACTERIAL COLONY

Estimate max

{ , }gK N for dynamics of bacterial population growth according to

max

( )( ) 1

g

dN N tK N t

dt N (1)

Figure 3. Logistic growth of bacterial population corresponding to the solution of eqn. (1)

EXAMPLE 4 –PARAMETERS OF MICHAELIS-MENTEN KINETICS

Estimate max 50

{ , }r s of the Michaelis-Menten expression

max

50

r srs s

(2)

for the kinetics of the enzyme-catalyzed biological reaction Substrate Products.

Figure 4. Michaelis-Menten kinetics for biological reactions according to eqn. (2)

t

N

s

r

Page 9: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 9 -

EXAMPLE 5 –STEADY-STATE GAIN MATRIX OF DISTILLATION COLUMN

Estimate the parameters 11 12 21 22

{ , , , }K K K K for the effect of reflux and boil-up rates, , ,L V on the

top and bottom concentrations, , ,D By x of a high-purity binary distillation column modeled as

11 12

21 22

D

B

y K K L

x K K V

y G m

(3)

Figure 5. Binary distillation column

Distillate

Bottoms

Condenser

Reboiler Boil-up

V

Reflux

L yD

xB

Feed

Page 10: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 10 -

EXAMPLE 6 – GAIN AND TIME CONSTANT OF DYNAMIC MODEL FOR TANK HEATER

Estimate 1

{ , ,}ˆ ˆ1p t

pt

F c UA

cUA

VK

F for the heater model (in Laplace domain)

transferfunction

( ) ( ) (0)1 1

effect of effect offorcing function initial condition

(manipulated input)

c

KT s T s T

s s (4)

Figure 6. Tank heater

EXAMPLE 7 – TRANSFER MATRIX OF DISTILLATION COLUMN

Find the transfer functions in the distillation column model (Figure 5)

11 12

21 22

( ) ( ) ( ) ( )

( ) ( ) ( ) ( )D

B

y s G s G s L s

x s G s G s V s

y G m

(5)

,F T

cT

V

,i

F T

Page 11: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 11 -

EXAMPLE 8 – BLACK BODY RADIATION

Given the data in Figure 7, what equation captures the relationship between black-body radiation

intensity B and wavelength at different temperatures T ?

Figure 7. Experimental data on the intensity of black-body radiation as a function of

wavelength at 3000, 4000, and 5000K

EXAMPLE 9 – THE ORBITAL PROCESSION OF MERCURY4

Newton’s laws of gravity suggest that a planet’s orbit would not be a fixed ellipse (with the sun at one focus) but rather a gradually rotating ellipse, due to the gravitational influence of other planets (Figure 8). The rate of this rotation (called orbital precession) can be measured very accurately. However, in 1859, Urbain Le Verrier5, discovered that the orbital precession of Mercury was slightly faster than predicted by Newton’s laws of gravity, even after all the effects of the other planets had been accounted for. The effect is small (roughly 43 arc seconds of rotation per century), but well above the measurement error (roughly 0.1 arc seconds per century). What equations can be used to account for

these measurements?

Figure 8. Fixed elliptical orbit (red) and orbital precession (blue) for a planet orbiting

the sun. (Image not drawn to scale, to make orbital precession easily visible.)

4 Excerpted from http://en.wikipedia.org/wiki/Two-body_problem_in_general_relativity.

5 French mathematician who specialized in celestial mechanics ; best known for his part in the discovery of the planet

Neptune.

5000 10000 15000 20000nm

Intensity a.u.

5000 K

4000 K

3000 K

Page 12: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 12 -

3. AD HOC METHODS FOR PARAMETER ESTIMATION

Capitalize on model structure

Simple visual interpretation

EXAMPLE 10– U-TUBE VISCOMETER (EXAMPLE 1)

Plot collected measurements; calculate average.

EXAMPLE 11 – EFFECT OF HUMIDITY ON SOLVENT EVAPORATION (EXAMPLE 2)

Straight line fit using a calculator

12

46

14

98

31 1 0 1

0

5

10

15

20

59 60 61 62 63 64 65 66 67 68 69 70

Fre

qu

en

cy .

0%

20%

40%

60%

80%

100%

Cu

mu

lative

% .

Frequency Cumulative %

30 40 50 60 70

7

8

9

10

11

12

Relative humidity,

Solv

ent

evap

orat

ion

,w

t

Page 13: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 13 -

EXAMPLE 12 – GROWTH AND SATURATION OF BACTERIAL COLONY (EXAMPLE 3)

Eqn. (1)max

( )N t N for 0t estimate maxN from long-term asymptote

maxN _______

Eqn. (1) ( ) (0)exp[ ] log[ ( )] log[ (0)]

g gN t N K t N t N K t for small t

estimate gK from plot of log[ ( )]N t vs. t ˆ

gK _______

10 20 30 40t hours

2 108

4 108

6 108

8 108

1 109

N

10 20 30 40t hours

2 107

5 107

1 108

2 108

5 108

1 109

N

Page 14: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 14 -

EXAMPLE 13 –PARAMETERS OF MICHAELIS-MENTEN KINETICS (EXAMPLE 4)

Eqn. (2)max

( )r s r for 0s estimate maxr from long-term asymptote

maxr ____

Eqn. (2)50 max

( ) / 2r s r estimate 50s as s at

max/ 2r

50s ____

Figure 9. Collected experimental data for enzymatic reaction rate as function of substrate.

- Convergence of r to maxr is slow.

- Estimates of max 50

,r s are not very accurate.

- Alternatives?

10 20 30 40 50 60s

0.1

0.2

0.3

0.4

0.5

r

Page 15: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 15 -

Eqn. (2) alternatives to plotting r vs. s . Lineweaver-Burk plot

50

max max

1 1 1s

r r r s: Plot

1

r vs.

1

s 2.0 4.3xy max

r____ 50

s____

Hanes-Woolf plot

50

max max

1sss

r r r: Plot

s

r vs. s 3.8 2.0xy max

r____ 50

s____

Eadie-Hofstee plot

max 50

rr r s

s: Plot r vs.

r

s 0.50 2.1y x max

r____ 50

s____

0.2 0.3 0.4 0.5

1

s

2.5

3.0

3.5

4.0

1

r

20 30 40 50 60s

40

60

80

100

120

s

r

0.05 0.10 0.15 0.20 0.25

r

s

0.1

0.2

0.3

0.4

0.5

r

Page 16: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 16 -

EXAMPLE 14 –GAIN MATRIX OF DISTILLATION COLUMN (EXAMPLE 5)

Response of ,D By x to unit-step change in each of ,L V estimates of

11 12 21 22{ , , , }K K K K

50 100 150 200t

0.2

0.4

0.6

0.8

L yD

50 100 150 200t

0.8

0.6

0.4

0.2

V yD

50 100 150 200t

0.2

0.4

0.6

0.8

1.0

1.2

L xB

50 100 150 200t

1.0

0.8

0.6

0.4

0.2

V xB

Page 17: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 17 -

EXAMPLE 15 – GAIN AND TIME CONSTANT OF TANK HEATER (EXAMPLE 6)

Eqn. (4) Response of T to step change M in cT is

( )(1 exp )

T t tK

M

( )T t

KM

for 0t K ________

and

( )

0.632T t

M for t ________

5 10 15 20t

0.1

0.2

0.3

0.4

0.5

0.6

0.7

T

Tc

Page 18: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 18 -

EXAMPLE 16 – TRANSFER MATRIX OF DISTILLATION COLUMN (EXAMPLE 7)

Response of ,D By x to unit-step change in each of ,L V

Approximate transfer functions by 11 12 21 22

11 12 21 22

11 12 21 22

{ , , , }1 1 1 1

s s s sK e K e K e K e

s s s s

11 12 21 22

{ , , , }K K K K _____________________

11 12 21 22

{ , , , } _____________________

11 12 21 22{ , , , } _____________________

20 40 60 80 100t

0.2

0.4

0.6

0.8

L yD

20 40 60 80 100t

0.8

0.6

0.4

0.2

V yD

20 40 60 80 100t

0.2

0.4

0.6

0.8

1.0

L xB

20 40 60 80 100t

1.2

1.0

0.8

0.6

0.4

0.2

V xB

Page 19: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 19 -

EXAMPLE 17 – BLACK BODY RADIATION (EXAMPLE 8)

Wien eqn. (1896):

2

5

2Intensity

exp[ ]hckT

hc

Rayleigh-Jeans eqn. (ca. 1900):

4

2Intensity

ckT (ultraviolet catastrophe)

Planck eqn. (1900):

2

5

2Intensity

(exp[ ] 1)hckT

hc

Figure 10. Intensity of black-body radiation as a function of wavelength at 3000, 4000, and

5000K. Solid line: Planck eqn. Dotted line: Rayleigh-Jeans eqn. Dashed line: Wien eqn.

HWNTHI: When are Wien’s and Rayleigh-Jeans’ equations good approximations of Planck’s equation?

EXAMPLE 18 – THE ORBITAL PROCESSION OF MERCURY (EXAMPLE 9)

Mercury’s orbital precession can be fully explained by Einstein’s equations of general relativity.

- Simple methods for estimation may not handle noise very well. - The right experiments may reveal information much better than the wrong experiments. - The right experiments or the right equations may be far from obvious, even if sophisticated

methods are used.

Rayleigh-Jeans

Wien

Planck

5000 10000 15000 20000nm

Intensity a.u.

Page 20: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 20 -

4. QUICK OVERVIEW OF REGRESSION (LEAST-SQUARES) BASICS

4.1 Least-squares methods

- Rely on parametric model whose parameters are estimated via regression (least squares). - Provide good accuracy if used appropriately.

- May require specialized software, but today’s off-the-shelf software is usually adequate. - Regression is linear (hence easy) if model parameters appear linearly, e.g.

1 1

( ) ... ( ) noisep p

y x x (6)

- Regression is nonlinear (hence less easy) if model parameters appear nonlinearly, e.g.

1( , ,..., ) noise

py g x (7)

- Note: The input-output model relationship in linear regression (effect of x on y ) may be

nonlinear.

EXAMPLE 19– LINEAR AND NONLINEAR REGRESSION

Linear regression: 2

1 2 3 4noiseuy u u e (Note: Effect of u on y is nonlinear.)

Nonlinear regression: 2

1noise

uy e

Least squares methods select model parameter values that minimize the sum of the squared errors between model outputs (at given input values) and measured dependent variables.

Data from n experiments: 1

{ ,..., }n

x x , 1

{ ,..., }n

y y

Error for experimental data point i :

1( , ,..., )ˆi i i p

e y g x (8)

Square(d) error for experimental data point k :

2 2

1[ ( , ,..., )]ˆi i i p

e y g x (9)

Sum of square(d) errors (SSE) for experimental data points 1,...,i n :

2 2

11 1

[ ( , ,..., )]ˆn n

i i i pi i

e y g x (10)

Minimum of sum of square(d) errors (SSE):

1 1

2 2

1,..., ,...,1 1

min min [ ( , ,..., )]ˆp p

n n

i i i pi i

e y g x (11)

Least-squares estimate of 1,...,

p:

1 1

2 2

1 1,..., ,...,1 1

ˆ ˆ{ ,..., } arg{min } arg{min [ ( , ,..., )] }ˆp p

n n

p i i i pi i

e y g x (12)

Page 21: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 21 -

4.2 Linear regression

4.2.1 Straight line fit

Model structure (straight line):

noise [ 1] noise noiseˆT

Ta

y ax b xb

φ

θ

φ θ (13)

Collected data (e.g., Figure 11):

1

n

x

x

x , 1

n

y

y

y (14)

(Note: The values of xare assumed to be known exactly, whereas y values are noisy. Namely, the

experiment can be performed again at the exact same values x , but the outcome ywill not necessarily

be the same, due to random noise.)

Figure 11 – Example of regression on a straight line.

Estimate parameters via matrix algebra.

- Express the error vector e as

1 1 1 1

1

1n n n n

y a x b y xa

by a x b y x

e

y X

(15)

- Objective: Minimize sum of squared errors (SSE) with respect to model parameters (Why?6):

22

2, [ , ] [ , ]ˆ ˆ1

22

2[ , ] [ , ] [ , ]ˆ ˆ ˆ1

( , )

min [ ( )] min min ( ) ( )

min min min ( )

T T

T T T

nT

k ka b a b a bk

kn

T

ka b a b a bk

J a b

y a x b

e

e

y X y X y X

e e e

(16)

6Stephen M. Stiger, “Gauss and the Invention of Least Squares ”, The Annals of Statistics , Vol.9, No.3 (May, 1981), 465-474.

x1 .... xn

x

y1

:

yn

y

Page 22: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 22 -

- Optimal parameter estimate (minimizing SSE) is satisfying the set of linear equations:

ˆ( )T TX X X y (17)

(Note: Eqn. (17) can be solved by Gaussian elimination or special methods. θ can also be written as

1

pseudo-inverse of

ˆ ( )T T

X

X X X y (18)

but is hardly ever computed as such. Rather, it is computed as solution of eqn. (17), or by direct optimization methods used on eqn. (16).) Explicitly: Eqn. (15)

211 1 1

1 1

1

1 1 11

n n

n k kT k kn n

kk kn

xx x x x

xx

X X (19)

1

1 1

11 1

n

n k kT kn

kkn

yx x x y

yy

X y (20)

1

1 12 2

1 1 1

1 1 1

22 2

2 2

2

ˆ ( )

ˆ11

1

( )

( )

T T

p p p

k k k kk k kp p p

k kk k k

x x x y xyx x

yxx y

n x xy

yx xn x x

n xy x y

n x x

x y x xy

X X X y

2 2

2

( )

( )( )

( )ˆ

n x x

x x y y

x x

y ax (21)

HWNTHI:

Why is it called "Regression"? (Hint: Do a web search with keywords Francis Galton, regression.)

Who discovered regression? (Hint: Do a web search with keywords Gauss, Legendre, Ceres, least squares.)

Why isn’t the pseudo-inverse of X simplified as 1 1 1( ) ( )T T TX X X X X TX 1X ?

Page 23: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 23 -

EXAMPLE 20– LINEAR EFFECT OF RELATIVE HUMIDITY ON SOLVENT EVAPORATION

Examine relationship between humidity (X ) and extent of solvent evaporation (Y ) in water-reducible paints during sprayout. (Data from Journal of Coating Technology, 65, 1983 via Milton and Arnold).

25n ,

11314.90

n

iix ,

1235.70

n

iiy

2

176308.53

n

iix , 2

12286.07

n

iiy ,

111824.44

n

i iix y

0.08 13.64y x

Figure 12 – Straight-line fit of solvent evaporation data

Figure 13 – 3D and contour plots of the quadratic form (SSE)

2

1( , ) ( ) ( ) ( ) 2ˆ

n T T T T T T

iiJ a b y a x b y X y X X X X y y y

30 40 50 60 70

7

8

9

10

11

12

Relative humidity,

Solv

ent

evap

orat

ion

,w

t

Observation,

i

Relative humidity,

ix (%)

Solvent evaporation,

iy (% wt)

1 35.3 11

2 29.7 11.1

3 30.8 12.5

4 58.8 8.4

5 61.4 9.3

6 71.3 8.7

7 74.4 6.4

8 76.7 8.5

9 70.7 7.8

10 57.5 9.1

11 46.4 8.2

12 28.9 12.2

13 28.1 11.9

14 39.1 9.6

15 46.8 10.9

16 48.5 9.6

17 59.3 10.1

18 70 8.1

19 70 6.8

20 74.4 8.9

21 72.1 7.7

22 58.1 8.5

23 44.6 8.9

24 33.4 10.4

25 28.6 11.1

Page 24: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 24 -

4.2.2 Some questions raised in EXAMPLE 20

How many digits of the estimates ˆ 13.64b or ˆ 0.08a are known with confidence?

How does the answer to the above question change depending on the confidence level?

What bounds can be established for b or a (i.e. 13.64 ?b or 0.08 ?a )?

If I have found that 13.64b b and 0.08a a with a certain confidence, can I conclude that ,a b lie in the rectangle (13.64 ,13.64 ) (0.08 ,0.08 )b b a a with the

same confidence?

The estimate ˆ 0.08a is suspiciously close to zero. Is ˆ 0a or is there no effect of X on Y ? How confident can I be about the answer to the above question?

How well does the above equation 0.08 13.64y x fit the data?

Is the straight-line assumption reasonable?

Today’s relative humidity is 50% (a nice day in Houston). What solvent evaporation should I expect?

…plus/minus what?

For days with relative humidity is 50% what solvent evaporation should I expect on the average?

…plus/minus what?

Page 25: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 25 -

4.2.3 Basis for answers to preceding questions (in the sequel)

For a particular experiment, the resulting estimate ˆ is a particular value of a random variable ˆ ( ˆ is called an estimator of , where is the “true” value. If the experiment is repeated at exactly the

same points 1

{ ,..., }n

x x , the values of 1

{ ,..., }n

y y will generally be different, due to noise, and, as a

result, the value of ˆ will also be different. Therefore, the values of correspond to values taken by a

random variable ˆ that follows a probability distribution with a certain average and spread,

quantified by the expected value ( ˆ[ ]E ) and covariance ( ˆCov( ) ) respectively. It can be shown that

if the model structure is correct and noise is white with zero average (i.e. there are no systematic errors and measurement errors are similar and independent of each other), then

ˆ[ ]E (22)

2 1

noiseˆ ˆ ˆCov[ ] [( )( ) ] ( )ˆ

Uncertainty of effect ofeffect of

paremeter experimentalmeasurement

estimates conditionsnoise

T TE X X (23)

While the precise meaning of eqns. (22) and (23) will be rigorously defined in the sequel, some

intuition can already be developed:

- Eqn. (22) suggests that on the average, the estimate ˆ is at the true value (the estimator is

unbiased). - Eqn. (23) suggests that the uncertainty in the estimate of (think of it as the margin above

and below the estimate ˆ ) eqn. (23) can be calculated. This uncertainty can be made small in two ways:

o Making the magnitude of measurement noise, 2

noise, small, by selecting a good

measurement method and instrument, and

o Making 1( )TX X “small”7. This can be accomplished by

Collecting more experimental data to make TX X “large”, and

Designing experimental conditions such that 1( )TX X is as “small” as possible.

This is the basis for the design of informative experiments.

7 Why the quotation marks on small?

Page 26: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 26 -

4.2.4 Multiple linear regression

Model structure:

0 1 1 1 1

( ,..., ) ... ( ,..., )m k k m

y x x x x (24)

Estimate parameters via matrix algebra. - Express the error vector e as

(1) (1) (1) (1)

1 1 1 1 0

( ) ( ) ( ) ( )

1 1 1

1 ( ,..., ) ( ,..., )

1 ( ,..., ) ( ,..., )

m k m

n n n nn km k m

y x x x x

y x x x x

e

y X

(25)

- Objective: Minimize sum of squared errors (SSE) with respect to model parameters (Why?8):

0

0

( ) ( ) ( ) ( ) 2

0 1 1 1 1,...,1

( ,..., )

min [ ( ( ,..., ) ... ( ,..., ))]k

k

n

m k k m

k

J

y x x x x

e2

2

22

21

min min( ) ( )

min min min( )

T

nT

kk

e

y X y X y X

e e e

(26)

- Optimal parameter estimate (minimizing SSE) is ˆ satisfying the set of linear equations:

ˆT TX X X y (27)

8Stephen M. Stiger, “Gauss and the Invention of Least Squares ”, The Annals of Statistics , Vol.9, No.3 (May, 1981), 465-474.

Page 27: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 27 -

EXAMPLE 21 –THE VAPOR PRESSURE OF ETHANE

Experimental data are available9 for ethane vapor pressure, P , as a function of temperature, T .

i iT (K)

iP (Pa)

1 92 1.7

2 94 2.8

3 96 4.6

4 98 7.2

5 100 11

6 102 17

7 104 25

8 106 37

9 108 53

10 110 75

11 112 100

12 114 140

13 116 200

14 118 270

15 120 350

16 122 470

17 124 610

18 126 790

19 128 1000

20 130 1300

21 132 1600

22 134 2000

23 136 2500

24 138 3100

25 140 3800

26 142 4700

27 144 5600

28 146 6800

29 148 8100

30 150 9700

31 152 11000

32 154 13000

33 156 16000

34 158 18000

35 160 21000

9D. G. Friend, H. Ingham, J. F. Ely, (1991). “Thermophysical Properties of Ethane”, Journal of Physical and Chemical

Reference Data , 20(2) pp. 275-347.

Page 28: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 28 -

36 162 25000

37 164 29000

38 166 33000

39 168 38000

40 170 43000

41 172 49000

42 174 55000

43 176 62000

44 178 70000

45 180 79000

46 182 88000

47 184 98000

48 186 109000

49 188 122000

50 190 135000

51 192 149000

52 194 164000

53 196 181000

54 198 198000

55 200 217000

56 202 238000

57 204 260000

58 206 283000

59 208 308000

60 210 334000

61 212 362000

62 214 392000

63 216 423000

64 218 457000

65 220 492000

66 222 530000

67 224 569000

68 226 611000

69 228 654000

70 230 700000

71 232 749000

72 234 800000

73 236 853000

74 238 909000

75 240 967000

76 242 1028000

77 244 1092000

Page 29: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 29 -

78 246 1159000

79 248 1229000

80 250 1301000

81 252 1377000

82 254 1456000

83 256 1538000

84 258 1623000

85 260 1712000

86 262 1804000

87 264 1900000

88 266 1999000

89 268 2103000

90 270 2210000

91 272 2321000

92 274 2436000

93 276 2555000

94 278 2678000

95 280 2806000

96 282 2938000

97 284 3075000

98 286 3216000

99 288 3363000

100 290 3514000

101 292 3671000

102 294 3834000

103 296 4002000

104 298 4176000

105 300 4356000

106 302 4543000

107 304 4738000

It is appropriate to model these data using the Riedel equation10

2ln lnB

P A C T DTT

10 Vetere, A., (1991). “The Riedel equation”, Ind. Eng. Chem. Res., Vol. 30, No. 11, pp. 2487-2492.

Page 30: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 30 -

Estimate parameters via matrix algebra.

- Express the error vector e as

2

1 11 1

2107107 107

107

11 ln

ln

: : : ::

: : : ::

1ln1 ln

T TP AT

B

C

P DT T

T

e

yX

- Objective: Minimize sum of squared errors (SSE) with respect to model parameters:

1072 22

2 21

min min( ) ( ) min min min( )T Tey X y X y X e e e

- Optimal parameter estimate (minimizing SSE) is ˆ satisfying the set of linear equations:

ˆT TX X X y

ˆ

2

ˆ ˆ ˆ

2599ˆln 51.86 5.128 ln 0.00001490

B

C DA

P T TT

Figure 14 – Curve-fitting example through linear regression on a curve (left) and residuals

(right).

100 150 200 250 300

2

4

6

8

10

12

14

T K

lnP

Pa

100 150 200 2500.04

0.03

0.02

0.01

0.00

0.01

0.02

T K

lnP

lnP

Pa

Page 31: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 31 -

4.3 Nonlinear regression

Model structure:

1

( , ,..., ) noisep

y g x (28)

Collected data (Figure 15):

1

n

x

x

x , 1

n

y

y

y (29)

(Note: The values of x are assumed to be known exactly, whereas y values are noisy.)

Figure 15 – Curve-fitting example through nonlinear regression on a curve

Estimate parameters via numerical optimization.

- Express the error vector eas

1 1 1

1

( , )

( , ,..., )

( , )

( , ,..., )

p

n n p

y g x

y g x

y g x

e y g x (30)

- Objective: Minimize SSE with respect to model parameters

1

1

22

1 2,...,1

22

21

( ,..., )

min [ ( , ,..., )] min ( , ) min( ( , )) ( ( , ))

min min min( )

p

p

nT

k k p

kn

T

J

y g x

e

e

y g x y g x y g x

e e e

(31)

2

2

ˆ arg{min ( , ) } arg{min( ( , )) ( ( , ))}Ty g x y g x y g x (32)

Note: Eqn. (31) poses a nonlinear regression problem because the parameters 1,...,

p appear

nonlinearly in 1

( , ,..., )i p

g x , not because y is a nonlinear function of x .

Solution of minimization in eqn. (31) via one of various numerical optimization methods, e.g.

Gauss-Newton, Newton-Raphson, Marquardt-Levenberg, conjugate-gradient, and others. Options available in Matlab, Mathematic, Excel,…

x1 ... xp

y1

:

yp

Page 32: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 32 -

EXAMPLE 22 – CURVE FITTING USING NONLINEAR REGRESSION

Data for unit-step response of temperature ˆy T vs. time ˆx t for heater in EXAMPLE 6 and

EXAMPLE 14:

0.12

0.20

0.62

1.22

1.32

1.64

1.67

2.13

2.68

2.93

x ,

0.24

0.29

0.96

1.38

1.46

1.69

1.61

1.79

1.92

1.84

y (33)

Eqn. (4)Model structure (unit-step response of first-order system, cf.):

(1 exp ) noise ( , , ) noise ( , ) noiseˆ ˆy K x g x K g x

K

(34)

Objective: Minimize the sum of squared errors (SSE):

22

2,1

22

21

( , )

min [ ( , , )] min ( , ) min( ( , )) ( ( , ))

min min min( )

nT

K

kn

T

J K

y g x K

e

e

y g x y g x y g x

e e e

(35)

where θ

1(1 exp )

( , ) ˆ

(1 expn

K x

K x

g x

Page 33: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 33 -

Figure 16 – 3D (left) and contour (right)plots of 2

1

( , ) [ (1 exp[ / ])]ˆn

J K y K x

The objective function to minimize, i.e. sum of squared errors ( , )J K , is approximately

quadratic near its optimum.

Page 34: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 34 -

Find ˆ ˆ,K numerically, e.g. using Excel Solver:

x y g(x,K,tau) y-g(x,K,tau)

K= 2.01589699416999

0.12 0.24 =K*(1-EXP(-B7/tau)) =C7-D7

tau= 1.00638304793785

0.2 0.29 =K*(1-EXP(-B8/tau)) =C8-D8 0.62 0.96 =K*(1-EXP(-B9/tau)) =C9-D9 1.22 1.38 =K*(1-EXP(-B10/tau)) =C10-D10 1.32 1.46 =K*(1-EXP(-B11/tau)) =C11-D11 1.64 1.69 =K*(1-EXP(-B12/tau)) =C12-D12 1.67 1.61 =K*(1-EXP(-B13/tau)) =C13-D13 2.13 1.79 =K*(1-EXP(-B14/tau)) =C14-D14 2.68 1.92 =K*(1-EXP(-B15/tau)) =C15-D15 2.93 1.84 =K*(1-EXP(-B16/tau)) =C16-D16

SSE = =SUMSQ(E7:E16)

0

0.5

1

1.5

2

2.5

0 1 2 3 4

De

ltaT

t

Page 35: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 35 -

Estimates ˆ ˆ,K depend on initial guess. E.g.,

o For initial guess 5, 7K Excel returns

K= -320647439.223494

tau= -419888391.447476

o For initial guess 2, 2K Excel returns

K= -61933209.8241383

tau= -107374184.4

Which of the above estimates is preferred? What initial guess should one use?

How does one know the estimates shown correspond to the global minimum of SSE?

0

0.5

1

1.5

2

2.5

0 1 2 3 4

De

ltaT

t

0

0.5

1

1.5

2

2.5

0 1 2 3 4

De

ltaT

t

Page 36: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 36 -

Nonlinear regression problems can range from very easy to very difficult to solve numerically,

because the function 2

1 11

( ,..., ) [ ( , ,..., )]ˆn

p pJ y g x may not be well behaved, e.g.

may have several local minima or singularities.

But prior knowledge may help narrow the anticipated range of 1ˆ ˆ{ ,..., }

p.

Near the optimum 1ˆ ˆ{ ,..., }

p, the objective function

2

11

[ ( , ,..., )] ( ( , )) ( ( , ))n

T

py g x y g x y g x

is fairly well behaved, since it is approximately convex quadratic (Figure 16) because

1 1 1

1

ˆ

( , ,..., )

( , )

( , ,..., )

( , )

( , )ˆ ˆ( , ) ( )

ˆ ˆ( , )

p

n n p

y g x

y g x

e y g x

y g x

g xy g x

X

y g x X X

ww X

(36)

where

1 1

1

1

( , ) ( , )

( , )ˆ

( , ) ( , )

p

n n

p

g x g x

g x g x

g x (37)

( ( , )) ( ( , )) ( ) ( )T Ty g x y g x w X w X (38)

Bonus: Eqns. (38), (35), and (23) Uncertainty of ˆ is approximately given by

2 1

noiseˆ ˆ ˆCov[ ] [( )( ) ] ( )ˆ

Uncertainty of effect ofeffect of

paremeter experimentalmeasurement

estimates conditionsnoise

T TE X X (39)

HWNTHI: What is the difference between eqns. (39) and (23)?

Page 37: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 37 -

4.3.1 The Gauss-Newton method

Eqn. (36) suggests a simple iterative method for finding the minimum in eqn. (31) (Gauss-Newton algorithm):

1. Let 1j and guess ˆj

.

2. Linearize ( , )g x around the vector ˆj

: θˆ ˆ ˆ( , ) ( , ) ( , )( )

j

j j j

X

gg x g x x

.

3. Find 1

ˆj

recursively:

2

2

1 2

21 1

pseudo-inverse of

1

Previousguess

ˆ ˆ ˆ ˆ ˆarg{min ( , ) ( ) } arg{min ( , ) }

ˆ ˆ( ) ( ) ( ( , ) )

ˆ ( )

j

j

j j j j j j j j

T T T T

j j j j j j j j j j

T T

j j j j

w

J

y g x X y g x X X

X X X w X X X y g x X

X X X

Correction at -th iteration

ˆ( ( , ))j

j

y g x

4. If 1 1

ˆ ˆ ˆj j j

stop. Else, 1

ˆ ˆj j

, 1j j , and go to step 2.

To avoid convergence to a locally rather than globally optimum estimate of or non-convergence at

all, apply the above procedure for several initial guesses1 .

Nonlinear regression problems can range from very easy to very difficult to solve numerically!

Page 38: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 38 -

EXAMPLE 23 – CURVE FITTING USING NONLINEAR REGRESSION WITH GAUSS-NEWTON

Continuing EXAMPLE 22 – Curve Fitting Using Nonlinear Regression.

Apply Gauss-Newton with

1

1

1

1

ˆ

ˆ 1ˆ

2

ˆ ˆ

ˆ

2ˆ( , )

(1 )

( , ) ˆ

(1

ˆ1ˆ (1 ) ˆ

ˆ ˆ(1 )1

ˆ

n

j

j

k

n k n j

n j

j

j

x

x

x

x jx

k k

x x

xk j n

j

K e

K e

K x ee

K e

K e K x ee

g x

X

g x

ˆ

ˆ

ˆ

j

j

j

K K

Results vary depending on initial guess:

Figure 17 – Gauss-Newton on EXAMPLE 23. Note convergence (left) and divergence (right)

depending on initial guess.

Page 39: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 39 -

Matlab code for EXAMPLE 23.

clear x y theta

%data

y = [0.956808, 1.69298, 0.244152, 1.79394, 1.84449, …

1.4625, 1.92179, 1.3773, 0.285332, 1.61321]';

x = [0.617823, 1.64468, 0.124142, 2.12837, 2.92554, …

1.3211, 2.67768, 1.21541, 0.199516, 1.66978]';

p = size(x,1);

tol = 0.001;

error = 2*tol;

% Initialization

theta(:,1) = rand(2,1);

k = 1;

% Gauss Newton

while (error > tol) & (k < 100)

fxtheta_k = theta(1,k)*(ones(p,1) - exp(-x/theta(2,k)));

X_k = [ones(p,1)-exp(-x/theta(2,k)) -theta(1,k)*x.*exp(-x/theta(2,k))];

theta(:, k+1) = theta(:, k) + (X_k'*X_k)\(X_k'*(y - fxtheta_k))

error = norm(theta(:, k+1)-theta(:, k))/norm(theta(:, k+1));

k = k+1;

end

% Plot results

figure(501)

plot(theta','-o')

legend('K','tau',0)

xlabel('iteration, k')

Page 40: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 40 -

5. BACKGROUND ON PROBABILITYDISTRIBUTIONS11

Definition 1 – Random variable

A variable that takes values by chance

What is chance? Random variable usually denoted by capital letter, e.g. ,X Y .

Value taken by random variable usually denoted by lowercase letter, e.g. ,x y .

EXAMPLE 24 – RANDOM VARIABLES

Time taken by students to complete an exam. Number of defective wafers in a semiconductor manufacturing line.

Rainy days in August Measurement error in a chemical analyzer.

Definition 2 – Discrete random variable

A random variable that can assume values from a countable set, e.g. integers, rationals…

HWNTHI: Who was Georg Cantor? Definition 3 – Continuous random variable

A random variable that can take values in a not necessarily bounded interval of the real numbers.

EXAMPLE 25 – DISCRETE AND CONTINUOUS RANDOM VARIABLES

Classify variables in EXAMPLE 24.

So, how can a random variable take values by chance?

11

Covered in any introductory textbook, e.g.

Montgomery, Runger: Applied Statistics and Probability for Engineers , Wiley. Walpole, Myers, Myers, and Ye, Probability & Statistics for Engineers & Scientists , Prentice-Hall.

Milton, Arnold, Introduction to Probability and Statistics, McGraw-Hill.

Page 41: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 41 -

5.1 Univariate distributions

How single random variables take values by chance. Definition 4 –Probability distribution function for discrete random variable

: [0,1] : ( )f x f x is a distribution function for the discrete random variable X if the probability

of randomly landing at x is ( )f x :

( )P X x f x (40)

Figure 18 – Example of probability distribution function for discrete random variable.

Definition 5 –Probability distribution function for continuous random variable

: [0,1] : ( )f x f x is a distribution function for the continuous random variable X if the

probability of randomly landing in the interval [ , ]x x dx is ( )f x dx :

( )P x X x dx f x dx (41)

Figure 19 – Example of probability distribution function for continuous random variable.

X continuous random variable ( ) 1

[ ] ( )

[ ] 0

b

a

f x dxP a X b f x dx

P X c

(42)

X discrete random variable ( ) 1

[ ] ( )[ ] ( )x

a x b

f xP a X b f x

P X c f c (43)

x

fx

Page 42: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 42 -

Definition 6 –Cumulative probability distribution function for random variable

: [0,1] : ( )f x f x is a distribution function for the continuous random variable X

( ) ˆF x P X x (44)

X discrete random variable ( ) ( )r x

F x f r (45)

Figure 20 – Example of cumulative probability distribution function for discrete random

variable.

X continuous random variable ( ) ( )x

F x f r dr (46)

Figure 21 – Example of cumulative probability distribution function for continuous random

variable.

0.00.20.40.60.81.0

x

Fx

0.00.20.40.60.81.0

x

Fx

Page 43: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 43 -

EXAMPLE 26– DISCRETE PROBABILITY DISTRIBUTION FUNCTIONS

Binomial distribution: (Excel function: BINOMDIST) p Probability one of two possible outcomes will occur in a single random trial

Probability one out of two possible outcomes will occur exactly x times in n random trials is

!

( ) (1 ) (1 ) , 0,1,2,3,...,! !

x n x x n xn n

f x p p p p x nx n x x , (47)

(Why?)

Poisson distribution: (Excel function: POISSON) Binomial approximation for 0 1p

(rare event) with k np

( ) , 0,1,2,3,... 0!

k xe kf x x k

x ,

Figure 22. Binomial and Poisson distributions

Excel code: x n-x f(x) F(x) fPoisson FPoisson(x)

0 =n-A2 =1*p^A2*(1-p)^B2 =SUM(C$2:C2) =EXP(-k)*k^A2/FACT(A2) =SUM(E$2:E2) Binomial, n = 20 p = 0.4

1 =n-A3 =C2*B2/A3*p/(1-p) =SUM(C$2:C3) =EXP(-k)*k^A3/FACT(A3) =SUM(E$2:E3)

2 =n-A4 =C3*B3/A4*p/(1-p) =SUM(C$2:C4) =EXP(-k)*k^A4/FACT(A4) =SUM(E$2:E4) Poisson, k = =H2*J2

… … … … …

20 =n-A22 =C21*B21/A22*p/(1-p) =SUM(C$2:C22) =EXP(-k)*k^A22/FACT(A22) =SUM(E$2:E22)

0.00

0.05

0.10

0.15

0.20

0 5 10 15 20

x

f(x) Binomial, n = 20 p = 0.4

Poisson, k = 8

0.00

0.10

0.20

0.30

0 5 10 15 20

x

f(x) Binomial, n = 20 p = 0.1

Poisson, k = 2

Page 44: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 44 -

EXAMPLE 27– CONTINUOUS PROBABILITY DISTRIBUTION FUNCTIONS

Normal distribution: (Excel function: NORMDIST)

21

21( )

2

x

f x e.

(48)

Figure 23. The normal (Gaussian) distribution

HWNTHI: Verify:

21

211

2

x

e dx

Chi-square distribution with degrees of freedom: (Excel function: CHIDIST)

/2 1 /2

/2

10

( ) 2 ( / 2)0 elsewhere

xx e xf x (49)

where 1

0( ) ˆ z tz t e dt

Figure 24. The chi-square distribution

0 2 4 6 8 10 12 140.0

0.1

0.2

0.3

0.4

0.5

x

fx 1

2

3

6

Page 45: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 45 -

How to capture location and spread of distributions? Definition 7 – Expected value (mean, average) of discrete random variable

The expected value of the discrete random variable X with distribution function f is

[ ] ( )ˆXx

E X xf x (50)

Definition 8 – Expected value (mean, average) of continuous random variable

The expected value of the continuous random variable X with distribution function f is

[ ] ( )ˆXE X xf x dx (51)

Figure 25 – Mean of random variable is point of balance (because ( ) 0x f x dx )

HWNTHI: Why?

Definition 9 – Variance of continuous or discrete random variable

2 2Var( ) [( ) ]ˆX E X (52)

Definition 10 – Standard deviation of continuous or discrete random variable

2Var( )ˆ ˆX (53)

Figure 26 – For most distributions [ 3 3 ] 1P X .

(Chebyshev theorem For all distributions 2[ 3 3 ] 1 1/ 3 0.9P X )

x

f x

Page 46: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 46 -

Some useful properties of average and variance Theorem 1 – Average properties

c constant [ ]E c c (54)

[ ] [ ]E cX cE X (55)

Theorem 2 – Variance properties

c constant Var( ) 0c (56)

2Var( ) Var( )cX c X (57)

What about [ ] [ ] [ ]E X Y E X EY or Var( )X Y ?

Page 47: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 47 -

Other useful parameters for distributions

Median: [ ] 0.5P X

Mode: 0df dx

Median,

F x 0.5

x

0.5

1

F x

Mode

Peak

point

x

f x

Mode

Inflection

point

x

0.2

0.4

0.6

0.8

1.0

F x

Page 48: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 48 -

EXAMPLE 28 – FROM "THE DILBERT PRINCIPLE", BY SCOTT ADAMS

Page 49: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 49 -

5.1.1 The normal distribution (EXAMPLE 27)

Theorem 3 – Average and variance of normal distribution

X is normally distributed, i.e.

21 1

( ) exp( )22

xf x

2

E[X]

Var[ ]X (58)

Notation: ~ ( ; )X N or 2~ ( ; )X N Random variable X follows the normal distribution with

mean and standard deviation or variance 2 .

Figure 27. Quantiles as a function of mean and standard deviation for the normal

distribution

Important quantiles for normal distribution (Figure 27)

0.68P X (59)

2 2 0.95P X (60)

3 3 0.997P X (61)

3 2 2 30.0

0.1

0.2

0.3

0.4

x

fx

Page 50: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 50 -

Rule of Thumb12: For most distributions the probability that a corresponding random variable will take a value between 3 and 3 is close to 100%.

Theorem 4 – Chebyshev inequality

For any distribution the probability that a corresponding random variable will take a value between

k and k is no less than 2

11k

. In short

2

1[ ] 1P X k

k (62)

Many improvements are possible for lower bound 2

11k

if additional assumptions are made

about the probability distribution. For example Theorem 5 – Justification for Three-Sigma Rule of Thumb

For any unimodal distribution the probability that a corresponding random variable will take a value

between k and k with 8

1.633

k is no less than 2

41

9k. In short

2

8 4[ ] 1

3 9k P X k

k (63)

Figure 28. Comparison of expected range of values for random variables that follow the

normal distribution, a unimodal distribution, or any distribution (Chebyshev bound).

HWNTHI: Verify the above rule of thumb for the exponential, chi-square, T, and uniform distribution. HWNTHI: What is “six sigma”?

HWNTHI: Who were Walter Shewhart and Edward Deming? HWNTHI: What is a control chart?

12 Pukelsheim F. 1994. “The three sigma rule”, American Statistician, 48(2): 88–91.

1.8 2.0 2.2 2.4 2.6 2.8 3.0

0.005

0.010

0.050

0.100

k

PX

k

Normal distribution

Chebyshev

Unimodal

Page 51: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 51 -

Definition 11 – Standard normal distribution

Let ( ; )X N . Then the random variable ˆ ˆX

Z X Z is called Standard Normal.

Theorem 6 – Average and variance of standard normal distribution

~ (0;1)Z N

Proof: Straightforward

Figure 29 – Standard normal distribution. Key points: 1 1 0.68P Z ,

2 2 0.95P Z , 3 3 0.997P Z

EXAMPLE 29– CALCULATIONS WITH NORMAL DISTRIBUTION

Let ~ (1;0.25)X N . Find [ 0.9 1.5]P X using tables (Table 1, p. 52) or software (e.g., Excel

function NORMDIST)

Figure 30 – Notation for calculations with standard normal distribution. Each of the shaded

areas is

2.

6 4 2 0 2 4 60.0

0.1

0.2

0.3

0.4

z

fz

z 2z 2

P Z z 2 2

z

f z

Standard Normal Distribution

Page 52: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 52 -

Table 1 – Cumulative standard normal distribution

Table values: ( ) [ ] ( )z

F z P Z z f x dx . Excel command: NORMDIST(z, 0, 1, true)

z ( )F z z ( )F z z ( )F z z ( )F z z ( )F z z ( )F z z ( )F z

0.00 0.5000 0.50 0.6915 1.00 0.8413 1.50 0.9332 2.00 0.9772 2.50 0.9938 3.00 0.9987

0.02 0.5080 0.52 0.6985 1.02 0.8461 1.52 0.9357 2.02 0.9783 2.52 0.9941 3.02 0.9987

0.04 0.5160 0.54 0.7054 1.04 0.8508 1.54 0.9382 2.04 0.9793 2.54 0.9945 3.04 0.9988

0.06 0.5239 0.56 0.7123 1.06 0.8554 1.56 0.9406 2.06 0.9803 2.56 0.9948 3.06 0.9989

0.08 0.5319 0.58 0.7190 1.08 0.8599 1.58 0.9429 2.08 0.9812 2.58 0.9951 3.08 0.9990

0.10 0.5398 0.60 0.7257 1.10 0.8643 1.60 0.9452 2.10 0.9821 2.60 0.9953 3.10 0.9990

0.12 0.5478 0.62 0.7324 1.12 0.8686 1.62 0.9474 2.12 0.9830 2.62 0.9956 3.12 0.9991

0.14 0.5557 0.64 0.7389 1.14 0.8729 1.64 0.9495 2.14 0.9838 2.64 0.9959 3.14 0.9992

0.16 0.5636 0.66 0.7454 1.16 0.8770 1.66 0.9515 2.16 0.9846 2.66 0.9961 3.16 0.9992

0.18 0.5714 0.68 0.7517 1.18 0.8810 1.68 0.9535 2.18 0.9854 2.68 0.9963 3.18 0.9993

0.20 0.5793 0.70 0.7580 1.20 0.8849 1.70 0.9554 2.20 0.9861 2.70 0.9965 3.20 0.9993

0.22 0.5871 0.72 0.7642 1.22 0.8888 1.72 0.9573 2.22 0.9868 2.72 0.9967 3.22 0.9994

0.24 0.5948 0.74 0.7704 1.24 0.8925 1.74 0.9591 2.24 0.9875 2.74 0.9969 3.24 0.9994

0.26 0.6026 0.76 0.7764 1.26 0.8962 1.76 0.9608 2.26 0.9881 2.76 0.9971 3.26 0.9994

0.28 0.6103 0.78 0.7823 1.28 0.8997 1.78 0.9625 2.28 0.9887 2.78 0.9973 3.28 0.9995

0.30 0.6179 0.80 0.7881 1.30 0.9032 1.80 0.9641 2.30 0.9893 2.80 0.9974 3.30 0.9995

0.32 0.6255 0.82 0.7939 1.32 0.9066 1.82 0.9656 2.32 0.9898 2.82 0.9976 3.32 0.9995

0.34 0.6331 0.84 0.7995 1.34 0.9099 1.84 0.9671 2.34 0.9904 2.84 0.9977 3.34 0.9996

0.36 0.6406 0.86 0.8051 1.36 0.9131 1.86 0.9686 2.36 0.9909 2.86 0.9979 3.36 0.9996

0.38 0.6480 0.88 0.8106 1.38 0.9162 1.88 0.9699 2.38 0.9913 2.88 0.9980 3.38 0.9996

0.40 0.6554 0.90 0.8159 1.40 0.9192 1.90 0.9713 2.40 0.9918 2.90 0.9981 3.40 0.9997

0.42 0.6628 0.92 0.8212 1.42 0.9222 1.92 0.9726 2.42 0.9922 2.92 0.9982 3.42 0.9997

0.44 0.6700 0.94 0.8264 1.44 0.9251 1.94 0.9738 2.44 0.9927 2.94 0.9984 3.44 0.9997

0.46 0.6772 0.96 0.8315 1.46 0.9279 1.96 0.9750 2.46 0.9931 2.96 0.9985 3.46 0.9997

0.48 0.6844 0.98 0.8365 1.48 0.9306 1.98 0.9761 2.48 0.9934 2.98 0.9986 3.48 0.9997

Page 53: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 53 -

5.1.2 The binomial distribution (EXAMPLE 26)

Theorem 7 – The normal distribution approximates the binomial distribution

Normal distribution = limit of binomial distribution as n , with , (1 )np np p

Figure 31. Binomial and normal (Gaussian) distributions

Excel code: Binomial, n = 20 p = 0.1

Normal, mu = =n*p sigma = =SQRT(n*p*(1-p))

x n-x f(x) F(x) fNormal(x) Fnormal(x)

0 =n-A2 =1*p^A2*(1-p)^B2 =SUM(C$2:C2) =NORMDIST(A2,n*p,SQRT(n*p*(1-

p)),FALSE) =NORMDIST(A2,n*p,SQRT(n*p*(1-p)),TRUE)

1 =n-A3 =C2*B2/A3*p/(1-p) =SUM(C$2:C3) =NORMDIST(A3,n*p,SQRT(n*p*(1-

p)),FALSE) =NORMDIST(A3,n*p,SQRT(n*p*(1-p)),TRUE)

2 =n-A4 =C3*B3/A4*p/(1-p) =SUM(C$2:C4) =NORMDIST(A4,n*p,SQRT(n*p*(1-

p)),FALSE) =NORMDIST(A4,n*p,SQRT(n*p*(1-p)),TRUE)

… … …. … … …

20 =n-A22 =C21*B21/A22*p/(1-p) =SUM(C$2:C22) =NORMDIST(A22,n*p,SQRT(n*p*(1-

p)),FALSE) =NORMDIST(A22,n*p,SQRT(n*p*(1-p)),TRUE)

0.00

0.05

0.10

0.15

0.20

0 5 10 15 20

x

f(x) Binomial, n = 20 p =

0.4

Normal, mu = 8

sigma = 2.19

0.00

0.10

0.20

0.30

0.40

0 5 10 15 20

x

f(x) Binomial, n = 20 p =

0.1

Normal, mu = 2

sigma = 1.34

Page 54: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 54 -

5.2 Multivariate distributions

Why multivariate distributions?

Multiple random variables frequently of interest (e.g., statistical mechanics)

Confidence intervals for multiple parameters of a model identified from experimental data Definition 12 – Multivariate random variable

A collection of (discrete or continuous) random variables, i.e. a collection of variables that take values by chance

Random variables usually denoted by capital letter, e.g. , ,X Y Z .

Value taken by random variables usually denoted by lowercase letter, e.g. , ,x y z .

So, how can a multivariate random variable take values by chance? Definition 13 –Probability distribution function for pair of discrete random variables

: [0,1] : ( , ) ( , )f x y f x y is a distribution function for the bivariate discrete random variable

( , )X Y if the probability of randomly landing at ( , )x y is ( , )f x y :

, ( , )P X x Y y f x y (64)

Definition 14 –Probability distribution function for pair of continuous random variables

: [0,1] : ( , ) ( , )f x y f x y is a distribution function for the bivariate discrete random variable

( , )X Y if the probability of randomly landing in the rectangle [ , ] [ , ]x x dx y y dy is ( , )f x y dxdy :

, ( , )P x X x dx y Y y dy f x y dxdy (65)

Figure 32 – Example of probability distribution function ( , )f x y for pair of continuous random

variables ( , )X Y . The probability that ( , )X Y will take a value in the infinitesimally

small rectangle of area dx dy is equal to the volume ( , )f x y dx dy .

Page 55: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 55 -

( , )X Y continuous random variable

max max

min min

min max min max[ , ] ( , )

( , ) 1

[ , ] 0

y x

y x

P x X x y X y f x y dxdy

f x y dxdy

P X x Y y

(66)

( , )X Y discrete random variable

min max min max

min max min max[ , ] ( , )

( , ) 1

[ , ] ( , )

y y y x x x

y x

P x X x y X y f x y

f x y

P X x Y y f x y

(67)

Figure 33 – Visualization of eqn. (66).

Page 56: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 56 -

How to capture location and spread of functions of multivariate distributions? Definition 15 – Expected value (mean, average) of function of discrete multivariate random variable

The expected value of the function 1

( ,..., )n

X X of the discrete multivariate random variable

1( ,..., )

nX X with distribution function f is

1

1 1 1all all

[ ( ,..., )] ( ,..., ) ( ,..., )ˆn

n n nx x

E X X x x f x x (68)

Definition 16 – Expected value (mean, average) of continuous random variable

The expected value of the function 1

( ,..., )n

X X of the continuous multivariate random variable

1( ,..., )

nX X with distribution function f is

1 1 1 1

[ ( ,..., )] ( ,..., ) ( ,..., ) ...ˆn n n nE X X x x f x x dx dx (69)

EXAMPLE 30– MOLECULAR DYNAMICS

If the total energy of n molecules is 1 1 1

( ,..., , ,..., , ,..., )n n n

x x y y z z , where 1 1 1

{ ,..., , ,..., , ,..., }n n n

x x y y z z

are the ( , , )x y z coordinates of each molecule, then the expected total energy of the n molecules is

1 1 1 1 1 1 1

... ( ,..., , ,..., , ,..., ) ( ,..., , ,..., , ,..., ) ...n n n n n n n

x x y y z z f x x y y z z dx dz

where 1 1 1

( ,..., , ,..., , ,..., )n n n

f x x y y z z is the joint probability distribution function.

Page 57: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 57 -

How are two random variables related? Theorem 8 – Average of sum of random variables

[ ] [ ] [ ]E X Y E X EY (70)

Definition 17 – Covariance of two random variables

Cov( , ) [ ]ˆ X Y X YX Y E X Y E XY (71)

where, in analogy to Definition 7 and Definition 8 and by eqns. (68) and (69): For ,X Y discrete

[ ] ( , )ˆx y

E XY xyf x y (72)

and for ,X Y continuous

[ ] ( , )ˆE XY xyf x y dxdy (73)

Theorem 9 – Variance is generally not additive

Var( ) Var( ) Var( ) 2Cov( , )X Y X Y X Y (74)

Definition 18 – Covariance matrix of bivariate random variable

[ ]ˆ TX YX

Cov( , ) Cov( , )

Var( )ˆCov( , ) Cov( , )

X X X Y

X Y Y YX (75)

Page 58: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 58 -

Given Cov( , )X Y how can one tell whether it is large or small?

Definition 19 – Correlation coefficient of two random variables

Cov( , )

ˆVar( )Var( )

X Y

X Y (76)

Theorem 10 – Cauchy-Schwarz inequality and linear dependence of perfectly correlated variables

1 1 (77)

1 ( )Y X

Y c X (78)

Definition 20 – Independent random variables

,X Y independent random variables iff

( , ) ( ) ( )X Y

f x y f x f y (79)

namely [ , ] [ ] [ ]P X x Y y P X x PY y (80)

for discrete, or

[ , ] [ ] [ ]P x X x dx y Y y dy P x X x dx P y Y y dy

for continuous variables.

Note: Eqn. (71)( , ) independent Cov( , ) 0X Y X Y

Theorem 11 – Variance properties

,X Y independent

Var( ) Var( ) Var( )X Y X Y (81)

Cov( , ) 0X Y ( , ) independentX Y , because ,X Y may be related nonlinearly.

However, if it is known that ,X Y are related linearly, then

Cov( , ) 0 ( , ) independentX Y X Y

0 ( , ) independentX Y

HWNTHI: Is X Y X Y ?

Page 59: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 59 -

EXAMPLE 31 –Cov( , ) 0X Y ( , )X Y INDEPENDENT

Discrete random variables ,X Y .

Values of ( , ) [ , ]ˆf x y P X x Y y shown below.

X Y

-2 -1 1 2

1 0 1/4 1/4 0 4 1/4 0 0 1/4

1 1 1 1 5[ ] 1(0 0) 4( 0 0 )

4 4 4 4 21 1 1 1

[ ] 2(0 ) ( 1)( 0) 1( 0) 2(0 ) 04 4 4 4

EY

E X

[ ] 0E XY

cov( , ) 0X Y

But ,X Y not independent, i.e.

( , ) ( ) ( )XY X Yf x y f x f y

In fact, 2Y X .

21

12

x

1

4

y

0

1

4

f x, y

Page 60: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 60 -

5.2.1 The multivariate normal distribution

Definition 21 - Multivariate normal distribution for random variable 1

( ,..., )n

X X :

1

1 /2 1/2

1 1( ,..., ) exp[ ( ) ( )]

2(2 ) det( )T

n nf x x x x (82)

where is a symmetric and positive semi-definite matrix and 1

[ ]ˆ T

nx xx ,

1[ ]ˆ T

n.

Notation: 1

( ,..., ) ~ ( )n n

X X N or 1/2

1( ,..., ) ~ ( )

n nX X N .

Theorem 12 – Average vector of multivariate normal distribution

1( ,..., )

nX X multivariate normal

1

[( ,..., )]n

E X X (83)

Theorem 13 – Covariance matrix for multivariate normal distribution

1( ,..., )

nX X multivariate normal

1 1 1

1

1

Cov( , ) Cov( , )

Cov[( ,..., )] ˆ

Cov( , ) Cov( , )

n

n

n n n

X X X X

X X

X X X X

(84)

Page 61: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 61 -

EXAMPLE 32– BIVARIATE NORMAL DISTRIBUTION

Eqn. (82) with 2n

2

2 2

2 2 2

1( , )

2 1

( ) ( ) 2 ( )( )1exp

2(1 )

X Y

X Y X Y

X YX Y

f x y

x y x y (85)

where

2

2,X X X Y

Y X Y Y

(86)

Figure 34. Bivariate normal distribution with 0X Y , 1

X Y , 0.3 .

(Left: 3D. Right: Contours)

HWNTHI: What if 0 in eqns. (85) and (86)?

Page 62: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 62 -

EXAMPLE 33– INDEPENDENT BIVARIATE NORMAL VARIABLES

Eqn. (82) with 2, 0n

2

2

0,

0X X

Y Y

(87)

and

2 2

2 2

2 2

2 2

( ) ( )1 1( , ) exp

2 2

( ) ( )1 1exp exp

2 22 2

X Y

X Y X Y

X Y

X YX Y

x yf x y

x y (88)

Figure 35. Bivariate normal distribution with 0X Y , 0.75, 1

X Y , 0 .

(Left: 3D. Right: Contours)

Page 63: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 63 -

An important random variable that is a function of a multivariate normal random variable

- Important for confidence area in parameter estimation through linear least squares

Theorem 14 – Ellipsoid of the Mahalanobis distance of 1

[ ]n

X X from 1

[ ]n

1[ ] ( , )ˆ T

n nX X NX multivariate normal

1( ) ( )TX X (89)

is 2 distributed with n DOF.

i.e.

1 /2 1 /2

/20

1[( ) ( ) ]

2 ( / 2 1)!

cT n x

nP c x e dx

nX X (90)

Page 64: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 64 -

EXAMPLE 34– WHERE TO EXPECT VALUES OF BIVARIATE NORMAL DISTRIBUTION

Eqn. (82) with 2n

2

1 1 2 1 1121 1 2 2

2 21 2 2

2/2 1 /2

2/20

/2

0

/2

0/2

[( ) ( ) ] [ ]ˆ

1

2 (2 / 2 1)!1

2

1

T

cx

cx

cx

c

XP c P X X c

X

x e dx

e dx

e

e

X X

Figure 36. Ellipsoids of probability /21 ce for

1( ) ( )TX X , with 1 20 ,

1 21, 0.3. (Left: Contours. Right: 3D)

HWNTHI: Verify the above plots.

Page 65: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 65 -

5.3 Importance of normal distribution

1. Central limit theorem (Theorem 15 below) real error distributions are normal-like. 2. Many common statistical procedures insensitive to deviations from normality. 3. Many distributions approximately equal to normal distribution. 4. Large number of statistical procedures easy to develop assuming normality.

Theorem 15 – The Central Limit Theorem: Random sample averages get closer to normal as sample size increases

Let 1

{ ,..., }n

X X be a collection of independent variables, identically distributed, with

1

[ ] ... [ ] ˆnE X E X

2

1Var[ ] ... Var [ ] ˆnX E X

Then the distribution of the random variable

1...

nX X

Xn

(91)

approaches the normal distribution with mean and variance 2

n as n , i.e.

lim ~ 0;1n

XN

n (92)

Proof:

11

... 1[ ] [ ] ( [ ] ... [ ])n

n

X XE X E E X E X

n n

1{ ,..., }

nX X independent

21

1

... 1Var[ ] Var[ ] (Var[ ] ... Var[ ])n

n

X XX X X

n n n.

Proof that X is approximately normal is more involved and omitted.

Remark: The variables 1{ ,..., }

nX X need not be normally distributed or even continuous.

Page 66: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 66 -

EXAMPLE 35 – EXPERIMENTAL ILLUSTRATION OF CLT VIA A PLAYING CARD GAME

Draw cards from a stack of 52, with reinsertion. Record results as shown below. Create histograms.

Data from actual experiment (held in class, Oct. 2006)

1x

2x

3x

4x x

6 10 9 2 6.75

4 7 10 9 7.5

5 10 9 5 7.25

10 10 7 8 8.75 7 2 10 10 7.25

3 1 1 4 2.25

10 6 5 1 5.5

10 2 5 5 5.5

10 10 3 10 8.25

8 5 6 8 6.75

7 10 10 10 9.25 4 2 9 6 5.25

9 3 1 6 4.75 10 3 2 10 6.25

10 6 9 5 7.5 8 6 3 5 5.5

10 3 1 1 3.75

1 10 4 10 6.25

6 4 8 3 5.25

9 10 4 3 6.5

Raw data

0

5

10

15

20

25

1 2 3 4 5 6 7 8 9 10

xF

req

ue

nc

0%

20%

40%

60%

80%

100%

Cu

mu

lati

ve %

e

Frequency

Cumulative %

Quadruple averages

0

2

4

6

1 3 5 7 9 More

xbar

Fre

qu

en

c

0%

20%

40%

60%

80%

100%

Cu

mu

lati

ve %

Frequency

Cumulative %

Page 67: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 67 -

EXAMPLE 36 – EXPERIMENTAL ILLUSTRATION OF CLT VIA COMPUTER SIMULATION

Exponential distribution

55 0( )

0 else

xe xf x

5

0( ) 5 0.2x

Xxf x dx xe dx

2 2 2 5150

1( ) ( ) ( ) 0.2

25x

X Xx f x dx x e dx

Computer simulation of 10,000 random drawings from the above distribution. Samples of size 1, 2, 5, 20 considered.

22

XX

n_______________________________________

55

XX

n_______________________________________

Page 68: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 68 -

2020

XX

n_____________________________________

Page 69: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 69 -

Remark: The Central Limit Theorem justifies averaging multiple measurements of the same quantity.

Unknown distribution ( )f x of experimental measurement error for the variable X with

average , standard deviation X

.

Random sample: 1...

nX X

Xn

Central Limit Theorem (Theorem 15) , XX X X

n

Figure 37 – How multiple measurements improve precision of parameter estimates.

Page 70: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 70 -

6. SAMPLE STATISTICS

Definition 22 – Population of observations

Total aggregate of observations that might occur from a particular operation is a population of observations.

Population of observations theoretically infinite

Convenient to think of population having very large finite size Definition 23 – Sample from population of observations

Observations (usually few) that have actually occurred are a sample from a population.

Each measurement can be thought of as the value taken by a random variable. Definition 24 – Random sample

In a random sample each observation follows exactly the same distribution (parent distribution) as the entire population.

- Random sampling assumption cannot be relied upon for real data; BUT - Special precautions and engineering/scientific sense can make assumption relevant

Definition 25 – Sample statistic

A random variable that does not depend explicitly on the any parameters associated with the entire population

Page 71: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 71 -

6.1 Point estimation

Goal: Estimate value of population parameters from sample statistics.

Definition 26 – Estimator

Statistic used to generate an estimate (real number) of population parameter.

Notation: ˆ estimator of parameter .

Requirements for ˆ

1. ˆ unbiased estimator of : ˆE

2. ˆVar[ ] 0 as sample size n 3. Others

Page 72: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 72 -

6.1.1 Population average ( ) estimation

Given experimental data, how can the average of a population be estimated?

Definition 27 – Sample average (a real number)

1 2...

ˆ inxx x x

xn n

(93)

Definition 28 – Sample average statistic (a random variable)

1 2...

ˆ inXX X X

Xn n

(94)

Theorem 16 (Gauss-Markov) – Estimating population average from sample average

If sample is random, X is the best linear unbiased estimator (BLUE) of : E X

In what sense "best"?

Page 73: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 73 -

6.1.2 Population variance 2( ) estimation

Given experimental data, how can the variance 2 of a population be estimated?

Definition 29 – Sample variance (a real number)

22 2

2 1( )( ) ... ( )

ˆ1 1

inx xx x x x

sn n

(95)

Definition 30 – Sample variance statistic (a random variable)

22 2

2 1( )( ) ... ( )

ˆ1 1

inX XX X X X

Sn n

(96)

Theorem 17 – Estimating population variance from sample variance

If sample is random, 2S is an unbiased estimator of 2 : 2 2E S

Definition 31 – Sample standard deviation (a real number)

22 2

1( )( ) ... ( )

ˆ1 1

inx xx x x x

sn n

(97)

Definition 32 – Sample standard deviation st atistic (a random variable)

22 2

1( )( ) ... ( )

ˆ1 1

inX XX X X X

Sn n

(98)

Theorem 18 – Estimating population standard deviation from sample st andard deviation

If sample is random, S is not an unbiased estimator of : E S

Page 74: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 74 -

EXAMPLE 37– U-TUBE VISCOMETER

Measure time for liquid level to drop from A to B as liquid flows through capillary to collection bulb.

Sample average (Excel function: AVERAGE):

1 50...

63.5ˆ50

x xx (seconds)

Sample standard deviation (Excel function: STDEV):

2 2

1 50( ) ... ( )

2.0ˆ49

x x x xs (seconds)

Bin Frequency Cumulative %

59 1 2%

60 2 6%

61 4 14%

62 6 26%

63 14 54%

64 9 72%

65 8 88%

66 3 94%

67 1 96%

68 1 98%

69 0 98%

70 1 100%

12

46

14

98

31 1 0 1

0

5

10

15

20

59 60 61 62 63 64 65 66 67 68 69 70

Fre

qu

en

cy .

0%

20%

40%

60%

80%

100%

Cu

mu

lative

% .

Frequency Cumulative %

Chance of randomly

landing in any bin 62

appears 26%

6 4 2 1

50

Page 75: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 75 -

EXAMPLE 38– VARIANCE ESTIMATE FOR SMALL SAMPLE

2n

2 2

1 2

2 2

1 2 1 21 2

2 2

1 2 2 1

1 2

( ) ( )

2 1

2 2

2 1

2 2

2

x x x xs

x x x xx x

x x x x

x x

Page 76: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 76 -

EXAMPLE 39– REVISIT EXAMPLE 37– U-TUBE VISCOMETER

Excel code: k xk

1 61 Bin Frequency Normal distribution Cumulative %

2 59 59 1 =NORMDIST(F3,xbar,s,FALSE)*50 0.02

3 65 60 2 =NORMDIST(F4,xbar,s,FALSE)*50 0.06

… … … … … …

11 65 68 1 =NORMDIST(F12,xbar,s,FALSE)*50 0.98

12 61 69 0 =NORMDIST(F13,xbar,s,FALSE)*50 0.98

… … 70 1 =NORMDIST(F14,xbar,s,FALSE)*50 1

50 62

xbar = =AVERAGE(B2:B51)

s = =STDEV(B2:B51)

So, what is our best guess and spread for the average time through the viscometer?

o 63.5 2.0? o 63.5 (70 63.5)?

o 63.5 (63.5 59)?

o 63.5 (70 63.5) or (63.5 59)?

o Something else? o Is the question posed correctly?

Are all these measurements necessary?

How many measurements are needed for desired confidence interval?

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

59 60 61 62 63 64 65 66 67 68 69 70

Fre

qu

en

cy

.

Normal distribution Frequency

Page 77: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 77 -

6.2 Interval estimation

6.2.1 Confidence interval for estimate of population average ( ) : The easy way (good for

large samples)

Given experimental data for X , within what bounds does the average lie and with what

confidence?

Central Limit Theorem (Theorem 15) for random sample

1...

nX X

Xn

approximately13 normally distributed with X

and X

n

/2 /21

XP z z

n

/2 /2

1P X z X zn n

(99)

Assume s (sample standard deviation, Definition 31) to get final result

/2 /2

s sx z x z

n n with confidence 1 (100)

Interpretation: Roughly a fraction (1 ) of all random samples of size n will contain within

/2 /2,

s sx z x z

n n

13 exactly, if X is normal

Page 78: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 78 -

EXAMPLE 40– CONFIDENCE INTERVAL FOR EXAMPLE 37– U-TUBE VISCOMETER

63.5x 2.0s 50n 1 0.95

/21.960z (Excel function: NORMINV(0.975,0,1))

/2

2.01.960 0.6

50

sz

n

63.5 0.6 with 95% confidence.

1 0.90

/21.645z (Excel function: NORMINV(0.95,0,1))

/2

2.01.645 0.5

50

sz

n

63.5 0.5 with 90% confidence.

EXAMPLE 41– HOW MANY MEASUREMENTS ARE ENOUGH?

Want confidence interval for EXAMPLE 37– U-Tube Viscometer to be 0.1

1 0.95/2

1.960z

2

/2

2.00.1 1.960 1500

0.1

sz n

n

1 0.90/2

1.645z

2

/2

2.00.1 1.645 1100

0.1

sz n

n

Page 79: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 79 -

Can the previous approach be applied to interval estimation for other parameters?

Table 2 – How to estimate a confidence interval for a parameter of a population

1. Construct random variable ( )Y which

- Has as only unknown parameter; - Has known distribution if is fixed.

2. Given confidence level 1 , find numbers 1 2,r r such that 1 2

[ ( ) ] 1P r Y r .

3. Solve for to find random variables 1 2,L L such that 1 2

[ ] 1P L L .

4. Do experiment and find values of 1 2, of 1 2

,L L .

5. Confidence interval for is 1 2, with confidence 1 .

Interpretation: A fraction 1 of experiments will result in estimate of within 1 2, .

Note: The above confidence interval is interpreted differently in the Bayesian (level of belief) probability setting.

Page 80: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 80 -

EXAMPLE 42– RECAPITULATE CALCULATION OF CONFIDENCE INTERVAL FOR

Follow Table 2 – How to estimate a confidence interval for a parameter of a population

Construct random variable ( )Y which

- Has as only unknown parameter; - Has known distribution if is fixed.

( ) ~ (0,1)ˆX

s Y Nn

Unknown parameter is ( is assumed to be known and equal to ). If is known, then X

n is

standard normal.

Given confidence level 1 , find numbers 1 2,r r such that 1 2

[ ( ) ] 1P r Y r .

/2 /2[ ] 1

XP z z

n

E.g., /2

] 2z for 0.05 .

Solve for to find random variables 1 2,L L such that 1 2

[ ] 1P L L .

/2 /2[ ] 1P X z X z

n n

Do experiment and find values of 1 2, of 1 2

,L L .

/2 /2

1 2

[ ] 1P x z x zn n

Confidence interval for is 1 2, with confidence 1 .

/2 /2

1 2

[ , ]x z x zn n

Page 81: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 81 -

6.2.2 Confidence interval for estimate of population average ( ) : The right way (good for

both small and large samples)

Follow Table 2 – How to estimate a confidence interval for a parameter of a population Definition 33 – Student's T-distribution with degrees of freedom

1

2 2

1

2 1( )

12

f t

t

t

where the gamma function is defined as 1

0( ) ˆ z tz t e dt or ( ) ( 1)!z z if 1,2,3,...z .

Figure 38 – T-distribution for various degrees of freedom

HWNTHI: Who was Student? (Hint: Search Gosset, Guinness, Student)

4 2 0 2 4

0.0

0.1

0.2

0.3

0.4

t

ft

T Distribution

10

9

8

7

6

5

4

3

2

1

DOF

Page 82: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 82 -

Theorem 19 – Average and variance of T-distribution

20,2

, 2

Theorem 20 – T-distribution and the normal distribution

The T-distribution approaches the normal distribution for large number of DOF.

Theorem 21 – Statistic involving that follows T-distribution

1{ ,..., }

nX X random sample of ~ ( , )X N

X

S n follows the T-distribution with 1n degrees of freedom.

2 2

1X

P t tS n

2 2

1S S

P X t X tn n

100(1 )% confidence bounds on :

/2 /2

s sx t x t

n n (101)

Page 83: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 83 -

Table 3 – Cumulative T-distribution.

Table values: [ ] 1PT t . Excel function: 1-TDIST(t,dof,1)

(Note: TDIST calculates tails area)

Degrees

of Freedom

t

0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 3.75

1 0.5000 0.5780 0.6476 0.7048 0.7500 0.7852 0.8128 0.8348 0.8524 0.8669 0.8789 0.8890 0.8976 0.9050 0.9114 0.9170

2 0.5000 0.5870 0.6667 0.7343 0.7887 0.8311 0.8638 0.8889 0.9082 0.9233 0.9352 0.9446 0.9523 0.9585 0.9636 0.9678

3 0.5000 0.5906 0.6743 0.7461 0.8045 0.8500 0.8847 0.9108 0.9303 0.9450 0.9561 0.9646 0.9712 0.9763 0.9803 0.9834

4 0.5000 0.5925 0.6783 0.7525 0.8130 0.8603 0.8960 0.9225 0.9419 0.9562 0.9666 0.9743 0.9800 0.9843 0.9876 0.9900

5 0.5000 0.5937 0.6809 0.7565 0.8184 0.8667 0.9030 0.9297 0.9490 0.9629 0.9728 0.9798 0.9850 0.9887 0.9914 0.9934

6 0.5000 0.5945 0.6826 0.7592 0.8220 0.8711 0.9079 0.9347 0.9538 0.9673 0.9767 0.9834 0.9880 0.9913 0.9936 0.9952

7 0.5000 0.5951 0.6838 0.7612 0.8247 0.8743 0.9114 0.9382 0.9572 0.9704 0.9795 0.9857 0.9900 0.9930 0.9950 0.9964

8 0.5000 0.5956 0.6847 0.7626 0.8267 0.8767 0.9140 0.9409 0.9597 0.9727 0.9815 0.9875 0.9915 0.9941 0.9960 0.9972

9 0.5000 0.5959 0.6855 0.7638 0.8283 0.8786 0.9161 0.9430 0.9617 0.9745 0.9831 0.9888 0.9925 0.9950 0.9966 0.9977

10 0.5000 0.5962 0.6861 0.7647 0.8296 0.8801 0.9177 0.9447 0.9633 0.9759 0.9843 0.9898 0.9933 0.9956 0.9971 0.9981

11 0.5000 0.5964 0.6865 0.7655 0.8306 0.8814 0.9191 0.9460 0.9646 0.9771 0.9852 0.9906 0.9940 0.9961 0.9975 0.9984

12 0.5000 0.5966 0.6869 0.7661 0.8315 0.8824 0.9203 0.9472 0.9657 0.9780 0.9860 0.9912 0.9945 0.9965 0.9978 0.9986

13 0.5000 0.5968 0.6873 0.7667 0.8322 0.8833 0.9212 0.9482 0.9666 0.9788 0.9867 0.9917 0.9949 0.9968 0.9980 0.9988

14 0.5000 0.5969 0.6876 0.7672 0.8329 0.8841 0.9221 0.9490 0.9674 0.9795 0.9873 0.9922 0.9952 0.9971 0.9982 0.9989

15 0.5000 0.5970 0.6878 0.7676 0.8334 0.8848 0.9228 0.9497 0.9680 0.9801 0.9877 0.9926 0.9955 0.9973 0.9984 0.9990

16 0.5000 0.5971 0.6881 0.7679 0.8339 0.8854 0.9235 0.9504 0.9686 0.9806 0.9882 0.9929 0.9958 0.9975 0.9985 0.9991

17 0.5000 0.5972 0.6883 0.7682 0.8343 0.8859 0.9240 0.9509 0.9691 0.9810 0.9885 0.9932 0.9960 0.9976 0.9986 0.9992

18 0.5000 0.5973 0.6884 0.7685 0.8347 0.8863 0.9245 0.9514 0.9696 0.9814 0.9888 0.9934 0.9962 0.9978 0.9987 0.9993

19 0.5000 0.5974 0.6886 0.7688 0.8351 0.8868 0.9250 0.9519 0.9700 0.9818 0.9891 0.9936 0.9963 0.9979 0.9988 0.9993

20 0.5000 0.5974 0.6887 0.7690 0.8354 0.8871 0.9254 0.9523 0.9704 0.9821 0.9894 0.9938 0.9965 0.9980 0.9989 0.9994

21 0.5000 0.5975 0.6889 0.7692 0.8357 0.8875 0.9258 0.9526 0.9707 0.9824 0.9896 0.9940 0.9966 0.9981 0.9989 0.9994

22 0.5000 0.5975 0.6890 0.7694 0.8359 0.8878 0.9261 0.9530 0.9710 0.9826 0.9898 0.9942 0.9967 0.9982 0.9990 0.9994

23 0.5000 0.5976 0.6891 0.7696 0.8361 0.8881 0.9264 0.9533 0.9713 0.9828 0.9900 0.9943 0.9968 0.9982 0.9990 0.9995

24 0.5000 0.5976 0.6892 0.7697 0.8364 0.8883 0.9267 0.9536 0.9715 0.9831 0.9902 0.9944 0.9969 0.9983 0.9991 0.9995

25 0.5000 0.5977 0.6893 0.7699 0.8366 0.8886 0.9269 0.9538 0.9718 0.9833 0.9903 0.9945 0.9970 0.9984 0.9991 0.9995

26 0.5000 0.5977 0.6894 0.7700 0.8367 0.8888 0.9272 0.9540 0.9720 0.9834 0.9905 0.9946 0.9971 0.9984 0.9992 0.9996

27 0.5000 0.5978 0.6894 0.7701 0.8369 0.8890 0.9274 0.9543 0.9722 0.9836 0.9906 0.9947 0.9971 0.9985 0.9992 0.9996

28 0.5000 0.5978 0.6895 0.7702 0.8371 0.8892 0.9276 0.9545 0.9724 0.9838 0.9907 0.9948 0.9972 0.9985 0.9992 0.9996

29 0.5000 0.5978 0.6896 0.7704 0.8372 0.8894 0.9278 0.9547 0.9725 0.9839 0.9908 0.9949 0.9973 0.9985 0.9992 0.9996

30 0.5000 0.5979 0.6896 0.7705 0.8373 0.8895 0.9280 0.9548 0.9727 0.9840 0.9909 0.9950 0.9973 0.9986 0.9993 0.9996

Page 84: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 84 -

EXAMPLE 43– CONFIDENCE INTERVAL FOR EXAMPLE 37– U-TUBE VISCOMETER

Two measurements only: 1

61x , 2

63x

Eqn. (97) 1.414s

1 0.95/2

12.71t /2

1.41412.71 12.71

2

st

n 63.5 12.7 with 95%

confidence(Excel function: TINV(0.05, 1))

1 0.90/2

6.314z /2

1.4146.314 6.314

2

st

n 63.5 6.3 with 90%

confidence(Excel function: TINV(0.10, 1))

Compare with normal distribution calculation in EXAMPLE 40:

1 0.95/2

1.960z

1 0.90/2

1.645z

Figure 39 – /2t for a sample of size n with 1 95% and 68% (corresponding to T-

distribution with 1n degrees of freedom). Note that as n increases, /2t approaches

/2z

, namely /2

2t for 1 95% and /2

1t 1 68% (cf. discussion after Theorem 3,

p. 49, and Theorem 6, p. 51)

0

2

4

6

8

10

12

14

2 3 4 5 6 7 8 9 10 1112 13 14 15 16 1718 19 20

n

t(a

lph

a/2

) .

95% confidence 68% confidence

Page 85: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 85 -

6.2.3 Selecting the number of measurements

Eqn. (100) Confidence bounds on become tighter as sample size (number of

measurements) n increases, but

cannot be reduced indefinitely by more experiments, because o Time and experimental resources increase o Difficult to eliminate all systematic errors

o Outliers may be present and hard to detect

6.2.4 Detecting measurement outliers

Definition 34 - Outlier

Experimental observation that appears to deviate markedly from other members of a sample.

What do outliers tell us?

Must outliers be removed and why?

Figure 40 – Effect of outlier on estimate of . Left plot: Measurement #10 appears (with

~95% confidence) not to follow the trend of the remaining 9 measurements 1 9{ ,..., }x x , because

it is outside the band 2x s . As a result, the estimate of is elevated. Right plot:

Removing measurement #10 and recalculating 2x s based on 1 8{ ,..., }x x confirms that

1 8{ ,..., }x x appear to follow the same distribution. The estimate of is reasonable, and its

confidence interval much narrower than calculated with the outlier in the data.

Detection of outliers difficult (sensitive to normal distribution assumption (vs. power-law distributions) for very rare events).

After outlier is removed from data, sample average and variance should be recalculated. Chauvenet’s Criterion for detecting outlier:

If value is in range where fewer than 0.5 measurements were expected under normality assumption,

then value is probably an outlier.

Many techniques available14.

14http://itl.nist.gov/div898/handbook/eda/section3/eda35h.htm

Page 86: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 86 -

EXAMPLE 44– ARE THERE OUTLIERS IN EXAMPLE 37– U-TUBE VISCOMETER

Is the measurement 70 an outlier? Assuming normality, [ 70] 0.0006786P x

Expected number of measurements 70 = 0.0006786 50 0.03 0.5

70 probably an outlier

Figure 41 – Data for EXAMPLE 37– U-Tube Viscometer. Horizontal gridlines above and below x are one s apart from each other.

57.5

59.5

61.5

63.5

65.5

67.5

69.5

71.5

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

experiment #

tim

e (

sec)

.

.

Page 87: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 87 -

EXAMPLE 45– OUTLIERS AND THE OZONE HOLE15

Figure 42 – Ozone in the atmosphere

16

All outliers should be taken seriously and should be investigated thoroughly for explanations. Automatic outlier-rejection schemes (such as throw out all data beyond 4 sample standard deviations

from the sample mean) are particularly dangerous. The classic case of automatic outlier rejection becoming automatic information rejection was

the South Pole ozone depletion problem. Ozone depletion over the South Pole would have been detected years earlier except for the fact that the satellite data recording the low ozone readings had

outlier-rejection code that automatically screened out the "outliers" (that is, the low ozone readings) before the analysis was conducted. Such inadvertent (and incorrect) purging went on for years. It was not until ground-based South Pole readings started detecting low ozone readings that someone17 decided to double-check as to why the satellite had not picked up this fact--it had, but it had gotten thrown out! The best attitude is that outliers are our "friends", outliers are trying to tell us something, and we should not stop until we are comfortable in the explanation for each outlier.

HWNTHI: Who were Mario Molina, Paul J. Crutzen and F. Sherwood Rowland?

15http://itl.nist.gov/div898/handbook/eda/section3/histogr8.htm

16 Copied from: http://www.theozonehole.com/

17 Nobel Prizes were awarded for that!

Page 88: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 88 -

6.2.5 Confidence interval estimation for population variance 2( )

Follow Table 2 – How to estimate a confidence interval for a parameter of a population

Definition 35 – Chi-square (2

) distribution with degrees of freedom (cf. EXAMPLE 27)

/2 1 /2

/2

10

( ) 2 ( / 2)0 elsewhere

xx e xf x

where the gamma function is defined as 1

0( ) ˆ z tz t e dt or ( ) ( 1)!z z if 1,2,3,...z

Figure 43 – Chi-square-distribution for various degrees of freedom

0 5 10 15 20 25 300.00

0.05

0.10

0.15

0.20

0.25

0.30

x

fx

2 Distribution

19

17

15

13

11

9

7

5

3

1

DOF

Page 89: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 89 -

Theorem 22 – Average and variance of Chi-square-distribution

, 2 2

Theorem 23 – Chi-square-distribution and the normal distribution

The 2 -distribution approaches the normal distribution for large number of DOF (see Figure 43above).

Theorem 24 – Statistic involving 2

that follows Chi-square-distribution

1{ ,..., }

nX X random sample of ~ ( , )X N

2 2

2 21

( ) ( 1)ˆ

ni

i

X X n Sfollows the 2 -distribution with 1n DOF.

Note: If the normality assumption ~ ( , )X N is not satisfied, the distribution of 2

2

( 1)n S may be

far from 2 . Compare with X

n which is approximately normally distributed even if X is not.

Page 90: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 90 -

Table 4 – Cumulative Chi-square-distribution

Table values: 2[ ]P X . Excel function: 1-CHIDIST( 2,dof)

(Note: CHIDIST calculates tail area)

Degrees

of

Freedom

2

0 0.75 1.5 2.25 3 3.75 4.5 5.25 6 6.75 7.5 8.25 9 9.75 10.5 11.25 12 12.75 13.5

1 0.0000 0.6135 0.7793 0.8664 0.9167 0.9472 0.9661 0.9781 0.9857 0.9906 0.9938 0.9959 0.9973 0.9982 0.9988 0.9992 0.9995 0.9996 0.9998

2 0.0000 0.3127 0.5276 0.6753 0.7769 0.8466 0.8946 0.9276 0.9502 0.9658 0.9765 0.9838 0.9889 0.9924 0.9948 0.9964 0.9975 0.9983 0.9988

3 0.0000 0.1386 0.3177 0.4778 0.6084 0.7102 0.7877 0.8456 0.8884 0.9197 0.9424 0.9589 0.9707 0.9792 0.9852 0.9896 0.9926 0.9948 0.9963

4 0.0000 0.0550 0.1734 0.3101 0.4422 0.5591 0.6575 0.7374 0.8009 0.8503 0.8883 0.9172 0.9389 0.9551 0.9672 0.9761 0.9826 0.9874 0.9909

5 0.0000 0.0199 0.0869 0.1864 0.3000 0.4141 0.5201 0.6139 0.6938 0.7601 0.8140 0.8570 0.8909 0.9174 0.9378 0.9534 0.9652 0.9742 0.9809

6 0.0000 0.0067 0.0405 0.1047 0.1912 0.2895 0.3907 0.4878 0.5768 0.6554 0.7229 0.7796 0.8264 0.8644 0.8949 0.9190 0.9380 0.9528 0.9643

7 0.0000 0.0021 0.0177 0.0553 0.1150 0.1919 0.2793 0.3705 0.4603 0.5446 0.6213 0.6889 0.7473 0.7968 0.8380 0.8719 0.8994 0.9216 0.9392

8 0.0000 0.0006 0.0073 0.0276 0.0656 0.1211 0.1906 0.2694 0.3528 0.4362 0.5162 0.5906 0.6577 0.7170 0.7683 0.8121 0.8488 0.8793 0.9042

9 0.0000 0.0002 0.0029 0.0131 0.0357 0.0729 0.1245 0.1880 0.2601 0.3369 0.4148 0.4908 0.5627 0.6289 0.6885 0.7410 0.7867 0.8258 0.8587

10 0.0000 0.0000 0.0011 0.0060 0.0186 0.0421 0.0780 0.1261 0.1847 0.2512 0.3225 0.3956 0.4679 0.5373 0.6022 0.6616 0.7149 0.7620 0.8030

11 0.0000 0.0000 0.0004 0.0026 0.0093 0.0233 0.0471 0.0815 0.1266 0.1810 0.2427 0.3093 0.3781 0.4470 0.5140 0.5774 0.6364 0.6900 0.7381

12 0.0000 0.0000 0.0001 0.0011 0.0045 0.0125 0.0274 0.0509 0.0839 0.1263 0.1771 0.2347 0.2971 0.3621 0.4278 0.4924 0.5543 0.6125 0.6662

13 0.0000 0.0000 0.0000 0.0004 0.0021 0.0064 0.0154 0.0307 0.0538 0.0854 0.1254 0.1731 0.2271 0.2858 0.3474 0.4101 0.4724 0.5327 0.5900

14 0.0000 0.0000 0.0000 0.0002 0.0009 0.0032 0.0084 0.0180 0.0335 0.0561 0.0863 0.1241 0.1689 0.2198 0.2752 0.3337 0.3937 0.4537 0.5124

15 0.0000 0.0000 0.0000 0.0001 0.0004 0.0016 0.0044 0.0102 0.0203 0.0358 0.0577 0.0866 0.1225 0.1648 0.2128 0.2653 0.3210 0.3784 0.4363

16 0.0000 0.0000 0.0000 0.0000 0.0002 0.0007 0.0023 0.0056 0.0119 0.0222 0.0376 0.0589 0.0866 0.1206 0.1608 0.2062 0.2560 0.3091 0.3641

17 0.0000 0.0000 0.0000 0.0000 0.0001 0.0003 0.0011 0.0030 0.0068 0.0134 0.0239 0.0391 0.0597 0.0862 0.1187 0.1567 0.1999 0.2472 0.2979

18 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0006 0.0016 0.0038 0.0079 0.0148 0.0253 0.0403 0.0602 0.0856 0.1166 0.1528 0.1938 0.2389

19 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0003 0.0008 0.0021 0.0046 0.0090 0.0160 0.0265 0.0412 0.0605 0.0849 0.1144 0.1489 0.1880

20 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0004 0.0011 0.0026 0.0053 0.0099 0.0171 0.0275 0.0418 0.0605 0.0839 0.1121 0.1451

21 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0006 0.0014 0.0031 0.0060 0.0108 0.0180 0.0283 0.0423 0.0604 0.0829 0.1099

22 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0003 0.0008 0.0017 0.0036 0.0067 0.0116 0.0188 0.0290 0.0426 0.0601 0.0817

23 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0004 0.0010 0.0021 0.0040 0.0073 0.0123 0.0195 0.0295 0.0428 0.0597

24 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0005 0.0012 0.0024 0.0045 0.0078 0.0129 0.0201 0.0299 0.0429

25 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0003 0.0007 0.0014 0.0027 0.0049 0.0084 0.0134 0.0206 0.0302

26 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0004 0.0008 0.0016 0.0030 0.0053 0.0088 0.0139 0.0210

27 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0005 0.0010 0.0018 0.0033 0.0057 0.0092 0.0143

28 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0003 0.0005 0.0011 0.0021 0.0036 0.0061 0.0096

29 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0003 0.0006 0.0013 0.0023 0.0039 0.0064

30 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0004 0.0007 0.0014 0.0025 0.0042

Page 91: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 91 -

Follow Table 2 – How to estimate a confidence interval for a parameter of a population

o Given , 2

2 2

212 2

( 1)1

n SP

o Solve for 2 :

2 2

2

2 2

12 2

( 1) ( 1)1

n S n SP

100(1 )% confidence bounds on 2 are

2 2

2 2

12 2

( 1) ( 1),

n s n s (102)

Page 92: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 92 -

EXAMPLE 46– ESTIMATING THE SPREAD OF A PARTICLE SIZE DISTRIBUTION

Observations of random variable: 3.4 3.6 4.0 0.4 2.0 3.0 3.1 4.1 1.4 2.5 1.4 2.0 3.1 1.8 1.6 3.5 2.5 1.7 5.1 0.7 4.2 1.5 3.0 3.9 3.0

2 ... 1.407s

Let 0.05 0.025, 1 0.9752 2

.

1 24n degrees of freedom

2

2

0.025

( 1) (24)(1.407)0.857

39.4

n s and

2

2

0.975

( 1) (24)(1.407)2.723

12.4

n s

20.857 2.723 and 0.926 1.650 with 95% likelihood (confidence).

Page 93: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 93 -

7. PROPAGATION OF ERRORS

If ( , , ,...)x f u v w , how do errors in measurement of , , ,...u v w affect x ?

EXAMPLE 47– MEASUREMENT OF PERMEABILITY OF A POROUS MEDIUM

Figure 44 – Experimental setting for measurement of permeability of porous rock

Assumptions: Newtonian fluid, Incompressible fluid, No slip at pore wall, 1Re

Darcy's law: Fluid velocity k dP

vdx permeability

in out

( )

q LkA p p

(103)

Sources of error in k o Errors in measurement of q (volumetric flow rate), (viscosity), L , A (cross-sectional

area), in out,p p (pressure in and out)

o Violation of assumptions

o Pressure leaks

PPeerrmmeeaabbllee RRoocckk

Pump

Pressure Gauge

Measuring Cylinder

outp

inp

outp

L

Page 94: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 94 -

7.1 Linear model

Measurement 0

...u v w

X x aU aV a W

0

...X u U v V w Wx a a a (104)

2 2 2 2 2 2 2 ...

2 Cov[ , ] 2 Cov[ , ] 2 Cov[ , ] ...X u U v V w W

u v v w w u

a a a

a a U V a a V W a a W U (105)

, , ,...U V W independent

2 2 2 2 2 2 2 ...X u U v V w Wa a a (106)

(Why?) Note:

2 2 2 2 2 2 ... ...X u U v V w W u U v V w W

a a a a a a (107)

7.2 Nonlinear model

( , , ,...)X f U V W and , , ,...U V W independent

, , ,... , , ,... , , ,...

0

( , , ,...) ( ) ( ) ( ) ...

...

U V W U V W U V W

u v w

U V W U V W

a a a

u v w

f f fX f U V W

U V W

x a U aV a W

apply linear model eqns.

Page 95: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 95 -

EXAMPLE 48– MEASUREMENT OF PERMEABILITY OF A POROUS MEDIUM

Eqn. (103)

2

2

( )

( )

( )

( )

( )

q

L

A

p

k

q L A p

q LkA p

q L

A p

Lq q

A p

qL

A p

qL L

A p

q LA A

A p

q Lp p

A p

k k k k kkq L A p

(108)

22 2 2 2

2 2 2 2 2 2

22 22 2

2

k q L A p

q pL A

k k k k k

q L A p

kq L A p

(109)

Let

4%q

q, 5% , 0.1%L

L, 0.5%A

A, 3%p

p.

Then

7.1%kk

HWNTHI: What estimate for k would the weighted sum of standard deviations rather than variances

yield?

Page 96: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 96 -

8. REGRESSION & CORRELATION

Definition 36 – Dependent and independent variables

X : Independent or predictor variable Y : Dependent or response variable

Y is random due to external disturbances unaccounted for, or measurement noise. The values of X are known either without or with some random error.

EXAMPLE 49 – INDEPENDENT AND DEPENDENT VARIABLES

X : Reactor temperature, Y : Extent of conversion of reactants to products X : Concentration of CO2 in atmosphere, Y : Atmospheric temperature

Definition 37 – Regression

Use data to build a mathematical model that describes how X affects Y .

Ideally want the probability distribution of the random variableY given x : |( )

Y xf y

Settle for |Y x

as a function of x .

Figure 45 – Curve of regression. Distribution of Y for a collection of ix (i.e.

|( )

Y xf y ) is

shown.

Definition 38 – Curve of regression

Curve of regression is plot of |Y x

versus x .

Experimental determination of curve of regression: Use random sample

1 1

1 1 1

{( , ),...,( , )}

( ) ( )

n n

n n n

x Y x x Y x

Y g x E Y g x E

where 1,...,

nx x are known values and

1,...,

nY Y are random variables with errors

1,...,

nE E

1,...,

nY Y take different values by chance every time they are measured.

x1 x2 x3 x4 x5 x6

x

Y x

Page 97: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 97 -

Figure 46 – Collection of data for regression. Four measurements are collected at each ix .

Definition 39 – Controlled and uncontrolled study

Controlled Study:ix selected by experimenter

Observational Study: ix observed at random

EXAMPLE 50 – CONTROLLED AND UNCONTROLLED STUDIES

Controlled study: Effect of reactor temperature X on conversion Y of reactants to products. Uncontrolled study: Effect of concentration of CO2 in atmosphereX on atmospheric temperatureY .

x1 x2 x3 x4 x5 x6

x

y

Page 98: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 98 -

Definition 40 – Linear and nonlinear regression

Assuming

1 2

( , , ,...)Y g x E

with curve of regression

| 1 2

( , , ,...)Y x

g x

where 1 2

{ , ,...}are unknown parameters we have

- Linear regression if 1 2

( , , ,...)g x is linear in 1 2, ,... (Relatively easy!)

- Nonlinear regression if 1 2

( , , ,...)g x is not linear in at least one of 1 2, ,... (More difficult!)

EXAMPLE 51 – LINEAR AND NONLINEAR REGRESSION

Linear regression: 2

| 1 2 3 4

x

Y xx x e

Nonlinear regression: 2

| 1

x

Y xe

Note: Nonlinearity of mapping |Y x

x is not the same as nonlinearity of regression.

HWNTHI:

Why is it called "Regression"? (Hint: Do a web search with keywords Francis Galton, regression.)

Who discovered regression? (Hint: Do a web search with keywords Gauss, Legendre, Ceres, regression.)

Page 99: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 99 -

8.1 Linear regression = linear least squares

System structure:

||

Random Randomvariable variableNumber

0 1 | 0 1

0 1

0 1

( , , ) 0( , , )(Why?)

ii i i Y x i

i iii i

Y x x E x

g x E EYE g x E

True relationship between measurements ,i ix y :

0 1

measuredmeasured impossibleto measure

i i iy x (110)

Task: Estimate 0 1

, . (Point estimates and confidence intervals)

Model structure: 0 1i i iy x (111)

To estimate 0 1, , minimize sum of squared errors (SSE):

0 1 0 1 0 1

22

0 1, , ,1 1

min SSE min minˆ ˆn n

i i ib b b b b bi i

e y b b x (112)

The result of the minimization is (How?)

1 0 122

ˆ ˆ ˆ,i i i i

i i

n x y x yy x

n x x (113)

x1 x2 x3 x4 x5 x6

x

Y x

x1 x2 x3 x4 x5 x6

x

y1

y2

y3

y4y5

y6

y

Page 100: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 100 -

How can the above model be used? Theorem 25 – Best guess of y given x

0 1ˆ ˆ

Y xx (114)

EXAMPLE 52 – BEST GUESS FOR GIVEN HUMIDITY LEVEL

50

ˆ ˆ 13.64 (0.08)(50) 9.64%ˆY x

y

Is the above model reasonable? Definition 41 – Residuals

ˆmodel fit,

0 1

measuredmeasured

ˆ ˆˆ

iy

i i ir y x (115)

What is the difference between error and residual?

For eqn. 0 1

measuredmeasured impossibleto measure

i i iy x to be valid, residuals must be “white” (What is white?)

EXAMPLE 53 – RESIDUALS IN EXAMPLE: ARE THEY WHITE?

Residuals appear white.

5 10 15 20 25

1.5

1.0

0.5

0.0

0.5

1.0

measurement point, i

y iy i

,w

t

Residual Measurement Model Fit

Page 101: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 101 -

EXAMPLE 54 – NON-WHITE RESIDUALS

Measurement point, i

lau

diseR

,r i

y iy i

Drifting

Measurement point, i

lau

diseR

,r i

y iy i

Drifting

Measurement point, i

lau

diseR

,r i

y iy i

Patterning

Measurement point, i

lau

diseR

,r i

y iy i

Patterning

Measurement point, i

lau

diseR

,r i

y iy i

Growing noise

Measurement point, i

lau

diseR

,r i

y iy i

Growing noise

Measurement point, i

lau

diseR

,r i

y iy i

Auto correlation

Measurement point, i

lau

diseR

,r i

y iy i

Auto correlation

Measurement point, i

lau

diseR

,r i

y iy i

Outlier

Measurement point, i

lau

diseR

,r i

y iy i

Outlier

Page 102: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 102 -

EXAMPLE 55 – THE DANGER OF A SEEMINGLY GOOD FIT18

- Experimental data are available19 for ethane vapor pressure, P , as a function of temperature,

T . - Is it appropriate to model these data using the Clausius-Clapeyron equation?

lnB

P AT

Using Excel

The fit 1ln 21.816 1919.5

TP with 2 0.999R appears very good. However…

Residuals not white.

18

Adapted from M. Shacham; M B Cutlip; Michael Elly (2009). “Beware of Errors in Numerical Problem-Solving”, Chemical

Engineering Progress, 105(11) pp. 21-25. 19

D. G. Friend, H. Ingham, J. F. Ely, (1991). “Thermophysical Properties of Ethane”, Journal of Physical and Chemical Reference Data , 20(2) pp. 275-347.

y = -1919.5x + 21.816R² = 0.999

0

2

4

6

8

10

12

14

16

18

0 0.002 0.004 0.006 0.008 0.01 0.012

lnP

1/T

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0 0.002 0.004 0.006 0.008 0.01 0.012

lnP

-ln

Ph

at

1/T

Page 103: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 103 -

At 92 K, ˆln 0.53, ln 0.95P P

80% error! - Is it appropriate to use the Riedel equation

ln ln EBP A C T DT

T

Data fit (min SSE) using the Excel Solver yields an almost perfect fit with

ˆ 35.38485ˆ 2280.99ˆ   2.20194ˆ 38.2766ˆ 211.821

A

B

C

D

E

and residuals better than before, albeit not white.

0

2

4

6

8

10

12

14

16

18

0 0.002 0.004 0.006 0.008 0.01 0.012

lnP

1/T

-0.15

-0.1

-0.05

0

0.05

0.1

0 0.002 0.004 0.006 0.008 0.01 0.012

lnP

-ln

Ph

at

1/T

Page 104: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 104 -

Excel code: T

(K) P (Pa)

1/T lnP lnPhat Residual

lnPRiedel Residual

A = 35.3848536221525

92 1.7

=1/A2 =LN(B2) =-1919.5*D2+21.816 =E2-F2

=A+B/A2+C_*LN(A2)+D*A2^E =E2-I2

B = -2280.98509778099

94 2.8

=1/A3 =LN(B3) =-1919.5*D3+21.816 =E3-F3

=A+B/A3+C_*LN(A3)+D*A3^E =E3-I3

C = -2.20193803114202

96 4.6

=1/A4 =LN(B4) =-1919.5*D4+21.816 =E4-F4

=A+B/A4+C_*LN(A4)+D*A4^E =E4-I4

D = -38.2766357628476

98 7.2

=1/A5 =LN(B5) =-1919.5*D5+21.816 =E5-F5

=A+B/A5+C_*LN(A5)+D*A5^E =E5-I5

E = -211.82107144986

100 11

=1/A6 =LN(B6) =-1919.5*D6+21.816 =E6-F6

=A+B/A6+C_*LN(A6)+D*A6^E =E6-I6

102 17

=1/A7 =LN(B7) =-1919.5*D7+21.816 =E7-F7

=A+B/A7+C_*LN(A7)+D*A7^E =E7-I7

… … … … … … … … … … … … …

300 4356000

=1/A106 =LN(B106) =-1919.5*D106+21.816

=E106-F106

=A+B/A106+C_*LN(A106)+D*A106^E =E106-I106

302 4543000

=1/A107 =LN(B107)

=-

1919.5*D107+21.816

=E107-

F107

=A+B/A107+C_*LN(A107)+D*A107^E =E107-I107

304 4738000

=1/A108 =LN(B108)

=-

1919.5*D108+21.816

=E108-

F108

=A+B/A108+C_*LN(A108)+D*A108^E =E108-I108

=SUMSQ(J2:J108)

Page 105: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 105 -

8.2 Properties of least-squares estimators

Definition 42 – Parameters, estimators for parameters, and estimates of parameters in simple linear regression

0 1, : Numbers (true values of parameters to be estimated)

0 1,B B : Estimators for 0 1

,

0 1ˆ ˆ, : Estimates of 0 1

, = optimal values of 0 1,B B

Theorem 26 – Distribution of estimators in simple linear regression

2

1 22

0 1

0 1

~ 0,

i i i i

i i

i i i

N

n xY x YB

n x x

B Y B x

Y x E

2

1 1

1

~ ,n

ii

B N

x x

and

2

210 0

1

~ ,

n

iin

ii

xB N

n x x

Note:

1 0,B B are linear combinations of

iY

Theorem 27 – Estimator of variance of measurement noise

2 2 SSEˆ

2S

n (116)

Page 106: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 106 -

Definition 43 – Widely used notation for linear regression

2

1

ˆn

xx ii

s x x

2

1

ˆn

yy ii

s y y or 2

1

ˆn

yy ii

S Y Y (What is the difference?)

1

ˆn

xy i ii

s x x y y or 1

ˆn

xy i ii

S x x Y Y (What is the difference?)

2

0 1sse ˆ i i

y b b x or 2

0 1SSE ˆ i i

Y B B x (What is the difference?)

Theorem 28 – Variance of parameter estimators in simple linear regression

2

1 1 1SSE 2

yy xy xx yy xyS B S B S S B S

1

xy

xx

SB

S

1

22

BxxS

0

2 22 i

Bxx

x

n S

Page 107: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 107 -

8.3 Confidence intervals in least squares for straight line

Theorem 29 – Confidence interval for1

(slope)

1 2 1 1 2

1xx xx

S SP B t B t

S S (117)

Proof: 1 12n

xx

BT

S S is T-distributed with 2n DOF …

EXAMPLE 56 – DOES X HAVE AN EFFECT ON Y IN EXAMPLE 51?

|0?

13.64 0.08Y xsignificantly

x

0 1

1 1

: 0

: 0

H

H

1

2

7150.05

63.89

572.44

18.09

SSE0.79

2

xx

yy

xy

yy xy

S

S

S

SSE S b S

Sn

Therefore observed value of 23T is 1 7.62

xx

bts S

237.62 0.0005 0.001P T P

Page 108: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 108 -

Theorem 30 – Confidence interval for0

(intercept)

2 2

0 /2 0 0 /21i i

xx xx

S x S xP B t B t

nS nS (118)

Proof: 0 0

2

i

xx

B

S x

nS

is T-distributed with 2n DOF …

EXAMPLE 57 – DOES THE STRAIGHT LINE IN EXAMPLE 51 CROSS (0,0)?

1

0.109 0.051

2.807 0.79 2.807 0.790.08 0.08 0.99

7150.05 7150.05P

Page 109: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 109 -

Theorem 31 – Confidence interval on|Y x

(predicted average value of Y given x )

2 2

| 2 | | 2

1 1ˆ ˆ 1Y x Y x Y x

xx xx

x x x xP t S t S

n S n S (119)

Proof: | |

2

ˆ

1 ( )

Y x Y x

xx

x xSn S

is T-distributed with 2n DOF …

Theorem 32 – Confidence interval on |Y x (predicted single value of Y given x )

2 2

2 2

1 1| 1 | | 1 1

xx xx

x x x xP Y x t S Y x Y x t S

n S n S (120)

Proof: 2

ˆ | |

1 ( )1

xx

Y x Y x

x xS

n S

is T-distributed with 2n DOF …

Page 110: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 110 -

EXAMPLE 58 – 90% CONFIDENCE INTERVAL FOR PREDICTIONS IN EXAMPLE 51

Eqn. (119)

2

|

113.6 0.08 1.52 0.000140( 52.6 )

25Y xx x

with 90% confidence. Eqn. (120)

226| 13.6 0.08 1.52 0.000140( 52.6 )

25Y x x x

with 90% confidence.

0 20 40 60 80 100

6

8

10

12

14

Relative humidity ,

Solv

ent

evap

orat

ion

,w

t

90 confidence interval on Y x

0 20 40 60 80 100

4

6

8

10

12

14

Relative humidity ,

Solv

ent

evap

orat

ion

,w

t

90 confidence interval on Y x

Page 111: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 111 -

EXAMPLE 59 – SIMPLE LINEAR REGRESSION: EFFECT OF CAR WEIGHT ON MILEAGE

Car #, i 1 2 3 4 5 6 7 8 9 10

Weight, tons, ix 1.35 1.9 1.7 1.8 1.3 2.05 1.6 1.8 1.85 1.4

Miles per gallon, iy 17.9 16.5 16.4 16.8 18.8 15.5 17.5 16.4 15.9 18.3

- Eqn. 1ˆ 4.03 ,

0ˆ 23.76 ˆ ˆ 23.76 4.03ˆ

Y xy x

- What is best guess for average mileage of cars weighing 1.7 tons?

Eqn. (114)1.7

ˆ ˆ 23.76 (4.03)(1.7) 16.9ˆY x

y miles per gallon

What is 90%-confidence interval for average mileage of cars weighing 1.7 tons? Eqn. (119)

2

|

123.76 4.03 0.657 1.72( 1.675 )

10Y xx x

1.7ˆ 16.9 0.2Y x

- What is 90%-confidence interval for mileage of my car, if it weighs 1.7 tons?

Eqn. (120)

211| 23.76 4.03 0.657 1.72( 1.675 )

10Y x x x ( 1.7) 16.9 0.7Y x

1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0

15.5

16.0

16.5

17.0

17.5

18.0

18.5

Car weight, tons

Car

mile

age

,m

pg

2 4 6 8 10

0.4

0.2

0.0

0.2

0.4

measurement point, i

y iy i

,w

t

0.5 1.0 1.5 2.0 2.5 3.0

12

14

16

18

20

22

Car weight, tons

Car

mile

age

,m

pg

90 confidence interval on Y x

0.5 1.0 1.5 2.0 2.5 3.010

12

14

16

18

20

22

Car weight, tons

Car

mile

age

,m

pg

90 confidence interval on Y x

Page 112: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 112 -

8.4 Repeated measurements and lack of fit

- How do we know that Y depends linearly on X ?

a. Look at residuals! (c.f. previous examples)

b. To detect lack of fit, need multiple measurements @ each ix , and hypothesis testing with

1

0

: linear regressionmodel notappropriate

: linear regressionmodel appropriate

H

H

1x

2x …

kx

11Y

21Y

1kY

12Y

22Y

2kY

.

.

.

.

.

.

.

.

.

11nY

22nY

kknY

Definition 44 – Auxiliary sums of squares

pure error

2

pe1 1

SSE ˆink

ij ii j

Y Y , lackof fit

lf peSSE SSE SSE

Theorem 33 – Variable with known distribution if 0H is valid

lf

2,

pe

SSE ( 2)ˆ

SSE ( )k n k

kF

n k

is F-distributed with 2k and n k DoF.

- Reject 0H if F-ratio is too large to have occurred by chance.

Page 113: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 113 -

EXAMPLE 60 – DOES REACTION YIELD DEPEND LINEARLY ON TEMPERATURE?

lf

pe

SSE (5 2)

SSE (15 5) F-distributed with (3,10) DOF

52.13 / 326.9

6.453 / 10F .

Reject 0H with 0.01P

1x

2x

3x

4x

5x

30 40 50 60 70

1iy 13.7 15.5 18.5 17.7 15

2iy 14 16 20 18.1 15.6

3iy 14.6 17 21.1 18.5 16.5

Page 114: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 114 -

8.5 Correlation

Why look at correlation?

- Find connection between random variables

- Assess mathematical model quality by finding correlation between model predictions and measurements

Theorem 34 – Estimator of (Pearson) correlation coefficient

ˆ ˆ ˆ xy

xx yy

SR

S S (121)

is estimator of Cov( , )

ˆVar( )Var( )

X Y

X Y

EXAMPLE 61 – CORRELATION BETWEEN METHODS FOR NITRATES MEASUREMENT

Measurement #, i 1 2 3 4 5 6 7 8 9 10

Method 1: ix 25 40 120 75 150 300 270 400 450 575

Method 2: iy 30 80 150 80 200 350 240 320 470 583

Both ix and

iy include measurement error.

Eqn. (121) ˆ 0.978ˆ ˆ xy

xx yy

sr

s s

What does this mean?

100 200 300 400 500

100

200

300

400

500

x

y

Page 115: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 115 -

Definition 45 – Coefficient of determination for regression model ( ) noiseY f x

2 SSE1

yy

RS

(122)

Coefficient of determination is fraction of variability ofY explained by linear regression model.

Theorem 35 – Coeff. of determination is squared correlation coefficient between measured and model-generated Y

2ˆ2

ˆˆ

yy

yy yy

SR

S S (123)

EXAMPLE 62 – COEFFICIENT OF DETERMINATION FOR EXAMPLE 20, EXAMPLE 59

iy = measurement,

0 1i iy x = model-generated. Definition 43 – Widely used notation for linear

regression, p. 106

For EXAMPLE 20, p.23:

2 18.061 0.72

63.89R

For EXAMPLE 59, p.111:

2 0.99931 0.90

10.46R

- Is model for EXAMPLE 20 more linear than model for EXAMPLE 59?

EXAMPLE 63 – COEFFICIENT OF DETERMINATION FOR EXAMPLE 55, P. 100

2 0.110951 0.99993

1611.4R

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16

lnP

hat

lnP

Page 116: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 116 -

8.6 Multiple linear regression

Generalize common ideas for linear regression via vector/matrix approach

- Polynomial regression: 2

| 0 1 2... k

Y x kx x x

- Multivariate linear regression, linear model: 1 2| , ,... 0 1 1

...kY x x x k k

x x

- General linear regression, nonlinear model:

1 2| , ,... 0 1 1 1 1

( ,..., ) ... ( ,..., )mY x x x m k k m

x x x x

In all linear regression problems the solution can be easily found using matrix algebra.

x

y

Y x 0 1x 2x2

Page 117: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 117 -

8.6.1 General least squares

- Model:

1| ,..., 0 1 1 1 1

( ,..., ) ... ( ,..., )mY x x m k k m

x x x x

- Random Sample:

1 2 1, ,... , | ,... : 1,...,

iY

i i mi i mix x x Y x x i n

2

0 1 1 1 1

(0, )

( ,..., ) ... ( ,..., )i i mi k k i mi i

N

Y x x x x E , 1,...,i n

, 0 1 1 1 1

( ,..., ) ... ( ,..., )i i mi k k i mi iy x x x x e , 1,...,i n

or

y X e (124)

Definition 46 – General least-squares minimization to estimate

0

2 22

2 2,...,1

minimize SSE ˆ ˆ ˆk

n

ii

e e y X (125)

Theorem 36 – General least-squares parameter estimator

The solution of eqn. (125) is

1ˆ ˆ( ) ( )T T T TX X X y X X X y (126)

HWNTHI: What are ˆ, ,X y ? What are ˆ, ,X y for simple (straight-line) linear regression?

Theorem 37 – Distribution of general least-squares parameter estimator

follows a multivariate normal distribution with

ˆ[ ]E (127)

12

noiseˆ ˆ ˆCov( ) [( )( ) ] ( )ˆ

Information

matrix

T TE X X (128)

Note: Eqn. (128) suggests an objective for design of experiments: Make 1( )TX X “small”.

Basis for Design of Experiments

Page 118: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 118 -

- Estimate of output value:

1| ,... 0 1 1 1 1

ˆ ˆ ˆˆ( ) ( ,..., ) ... ( ,..., )mY x x m k k m

y x x x xx (129)

with confidence intervals

1 1

/2 /2

1 1

/2 | /2

ˆ ˆ( ) 1 ( ) ( ) ( ) 1 ( )

ˆ ˆ( ) ( ) ( ) ( )

T T T T

T T T T

Y

y t s y y t s

y t s y t sx

x x X X x x x x X X x

x x X X x x x X X x (130)

where /2t follows the T-distribution with 1k degrees of freedom

- Estimate of noise variance:

12 2

2 1 2noise

ˆsse

ˆ( 1) ( 1) ( 1)

k

ii

e

n k n k n k

y X (131)

HWNTHI: What is the difference between the two formulas in eqn. (130)? Does it make sense?

Page 119: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 119 -

8.6.2 Polynomial least squares

- Model:

2

0 1 2| ... k

kY x x x x E

- Random Sample:

1 1

{( , | ),...,( , | )}n n

x Y x x Y x

2

0 1

independent, ~ (0, )

... k

i i k i i

N

Y x x 1,2,...,i n

0 1

... k

i i k i iy x x e 1,2,...,i n

or

y X e (132)

Definition 47 – Polynomial least-squares minimization to estimate

0

2 22

2 2,...,1

minimize SSE ˆ ˆ ˆk

n

ii

e e y X (133)

Theorem 38 – Polynomial least-squares parameter estimator

The solution of eqn. (133) is

1ˆ ˆ( ) ( )T T T TX X X y X X X y (134)

Eqn. (134)

2

2 1

1 2 2

.. ..

: : :

: : :

.. ..

T

k

i i ik

i i i

k k k k

i i i i

n x x x

x x x

x x x x

X X

0

1

ˆ

ˆ

ˆ

:

:ˆp

= :

:

T

i

i i

p

i i

y

x y

x y

X y

- Estimate:

| 0 1ˆ ˆ ˆˆˆ ... k

Y x ky x x (135)

Page 120: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 120 -

EXAMPLE 64 – POLYNOMIAL LINEAR REGRESSION

2

| 0 1 2Y xx x

2

2 3

2 3 4

T

i i

i i i

i i i

n x x

x x x

x x x

X X

0

1

2

ˆ

ˆ

ˆ=

2

i

i i

i i

y

x y

x y

0 150 2750

150 2750 56,250

2750 56,250 1,223,750

0

1

2

ˆ

ˆ

ˆ=

81.3

1228

24555

0

1

2

ˆ 27.3,ˆ 3.313,ˆ 0.111

5 10 15 20 25

x

5

10

15

20

y

ix

iy

5 14.0

5 12.5 10 7.0

10 5.0

15 2.1 15 1.8

20 6.2 20 4.9

25 13.2 25 14.6

Page 121: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 121 -

8.6.3 Multiple linear least squares

- Model:

1 2| , ,... 0 1 1 2 2

...kY x x x k k

x x x

- Random Sample:

1 2 1 2, ,... , | , ,...

iY

i i ki i i kix x x Y x x x

2

0 1 1

~ (0, )

...i i k ki i

N

Y x x E

0 1 1

...i i k ki iy x x e

or

y X e (136)

Definition 48 – Multiple linear least-squares minimization to estimate

0

2 22

2 2,...,1

minimize SSE ˆ ˆ ˆk

n

ii

e e y X (137)

Theorem 39 – Multiple linear least-squares parameter estimator

The solution of eqn. (137) is

1ˆ ˆ( ) ( )T T T TX X X y X X X y (138)

1 22

1 1 2 1 1

2

1 2

.. ..

: : :

: : :

.. ..

T

i i ki

i i i i ki i

ki i ki i ki ki

n x x x

x x x x x x

x x x x x x

X X

0

1

ˆ

ˆ

ˆ

:

:ˆk

= 1

:

:

T

i

i i

i ki

y

y x

y x

X y

- Estimate:

| 0 1 1 2 2ˆ ˆ ˆ ˆˆˆ ...

Y x k ky x x x

1,2,...,i n

Page 122: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 122 -

EXAMPLE 65 – MULTIPLE LINEAR LEAST SQUARES

y 17.9 6.5 16.4 16.8 18.8 15.5 17.5 16.4 15.9 18.3

1x 1.35 1.90 1.70 1.80 1.30 2.05 1.60 1.80 1.85 1.40

2x 90 30 80 40 25 45 50 60 65 30

1 22

1 1 2 12

2 1 2 2

i i

i i i i

i i i i

n x x

x x x x

x x x x

0

1

2

ˆ

ˆ

ˆ =

1

2

i

i i

i i

y

x y

x y

10 16.75 525

16.75 525 874.5

525 874.5 31,475

0

1

2

ˆ

ˆ

ˆ =

170

282.405

8887.0

0 1 2ˆ ˆ ˆ24.75, 4.16, 0.0149

Page 123: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 123 -

8.7 Confidence intervals and hypothesis testing in multiple linear regression

1| ,..., 0 1 1 1 1

( ,..., ) ... ( ,..., )mY x x m k k m

x x x x

(1) (1) (1) (1) (1)

1 1 1 0

( ) ( ) ( ) ( ) ( )

1 1 1

1 ( ,..., ) ... ( ,..., )

: : : :

: : : :

1 ( ,..., ) ... ( ,..., )

m k m

n n n n nkm k m

y x x x x b

by x x x x

y X

e

b

Theorem 40 - Least-squares estimate of 0

[ ]ˆ T

k

1ˆ ( )T TX X X y (139)

Theorem 41 – Probability distribution of estimatorB with estimate

2 1

1 noise( , ( ) )T

kNB X X (140)

- (1 )-Confidence Interval for , 0,...,ii k

/2 /2

ˆ ˆi ii i i iit s C t s C (141)

where, = 1 – Confidence level (142)

/2t = appropriate point of T-distribution with ( 1)n k degrees of freedom (143)

2

1 2

ˆSSE

ˆ ˆ( 1) ( 1) ( 1)

n

jje

sn k n k n k

y X (estimate of

noise) (144)

1( )T

ii iiC X X ( thi diagonal element of 1( )TX X ) (145)

- Prediction Interval for given 1...

T

kx xx :

1

/2

1ˆ /2|

ˆ ˆ| ( ) 1 ( )

( ) ( )

T T

T T

Y

Y y t s

y t sx

x x x X X x

x x X X x (146)

- Because 2 1

1 noise( , ( ) )T

kNB X X by Theorem 41, Theorem 14the variable

2

noise

1( ) ( )T TB X X B

is2

distributed with 1k DOF

2 2

noiseˆ ˆ[( ) ( ) ] 1T TP X X (147)

namely the values of are expected to be in the ellipsoid 2 2

noiseˆ ˆ( ) ( )T TX X with

confidence 1 .

Page 124: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 124 -

If 2

noise is unknown, it can be approximated by 2s , eqn. (144), or the exact formula

2ˆ ˆ[( ) ( ) ( 1) ( 1, 1)] 1T TP s k F k n kX X (148)

can be used, where ( 1, 1)F k n k is an appropriate point of an F-distributed variable with

( 1, 1)k n k DOF (i.e. [ ( 1, 1)] 1 ]P X F k n k ).

EXAMPLE 66 – DOES TEMPERATURE AFFECT CAR MILEAGE?

Model structure:

1 2| , 1 1 2 2 0Y x xx x

1

1.739 0.002166 3.026

( ) 0.002166 0.0002583 0.01719

3.0256 0.

1.35 90 1

: : :

: : :

1.01719 6.0

47

0 01

3 1

TX XX

28.6375 874.5 16.75

874.5 31475 525

16.75 525 10

TX X ,

282.405

8887

170

TX y

ˆ 4.1593 0.0149 24.75ˆT

optb

10 2

1SSE0.1416ˆ ˆ

( 1) 10 7jje

sn k

11 22 33

1.739 0.0002583 6.071C C C

/2 0.0250.05 2.365t t (Excel command: =TINV(0.05,7))

Therefore

14.159 (2.365)(0.1416) 1.739

4.159 0.442

20.0149 (2.365)(0.1416) 0.0002583

0.0149 0.0054

024.75 ___________________

Temperature does affect car mileage.

How likely is it that all three parameters 0 1 2, , will simultaneously lie close to the end of

their respective confidence intervals calculated above?

Car Number

( )iy

(mpg)

( )

1

ix

(Tons)

( )

2

ix

( 0F )

1 17.9 1.35 90

2 16.5 1.90 30

3 16.4 1.70 80

4 16.8 1.80 40

5 18.8 1.30 35

6 15.5 2.05 45

7 17.5 1.60 50

8 16.4 1.80 60

9 15.9 1.85 65

10 18.3 1.40 30

Page 125: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 125 -

Figure 47. The ellipsoid ˆ ˆ( ) ( ) 1T TX X (cf. eqn. (147)) showing corresponding

ellipses in the planes 0 0ˆ 0 (continuous line),

1 1ˆ 0 (dashed line), and

2 2ˆ 0 (dotted line). Eqn. (147) 2 2 2 2

noise1 1/ 0.1416 49.87 0 .

HWNTHI: How should the eqn. ˆ ˆ( ) ( ) 1T TX X change for different values of ?

Figure 48. Contours of a number of ellipses in the plane 0 0ˆ 0 of Figure 47.

1.5 1.0 0.5 0.0 0.5 1.0 1.5

0.02

0.01

0.00

0.01

0.02

1 1

22

Page 126: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 126 -

8.8 Nonlinear regression

- Model structure: ( , ) noisey f x (149)

- Direct minimization of SSE (e.g., EXAMPLE 22 – Curve Fitting Using Nonlinear Regression) - Linearization of ( , )y f x and use of linear regression result as starting point for nonlinear

regression (e.g., EXAMPLE 13 –Parameters of Michaelis-Menten Kinetics (EXAMPLE 4))

EXAMPLE 67 – FLUID FLOW THROUGH PIPE

Chapra & Canale, Numerical Methods for Engineers, Mc-GrawHill; 5th Ed., Case Study 20.4, p. 551 Assumed model structure:

1 2

0

a aQ a D S (150)

where Q : flow rate (ft3/s)

S : slope (ft/ft) D : diameter (ft)

0 1 2, ,a a a : coefficients to determine

Experiment D , ft S , ft/ft Q , ft3/s

1 1 0.001 1.4

2 2 0.001 8.3

3 3 0.001 24.2 4 1 0.01 4.7

5 2 0.01 28.9

6 3 0.01 84.0

7 1 0.05 11.1 8 2 0.05 69.0

9 3 0.05 200

Eqn. (150) is not linear in the parameters. Linearization trick: Eqn. (150)

1 2

1 1 2 2 1 20 1 2

1 2

0 1 2

( , ) ( , )

| , 0 1 1 1 2 2 2 1 2

log log log log

( , ) ( , )

x x

x xY x x

Y x x

Q a a D a S

x x x x

1 1 1

2 2 2 0

1

2

9 9 9

log 1 log log

log 1 log log log

: : : :ˆ

: : : :

log 1 log log

Q D S

Q D S a

a

a

Q D S b

y X

e

S Q

D

Page 127: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 127 -

X y

TX X TX y Eqn. (126) ˆ

0 1 2ˆ ˆ ˆ( , , ) (55.97,2.616,0.5368)a a a

Linearization is starting point for nonlinear regression!

Direct minimization of the nonlinear SSE using numerical optimization (e.g. Excel Solver).

1 0. 3.

1 0.30103 3.

1 0.477121 3.

1 0. 2.

1 0.30103 2.

1 0.477121 2.

1 0. 1.30103

1 0.30103 1.30103

1 0.477121 1.30103

0.146128

0.919078

1.38382

0.672098

1.4609

1.92428

1.04532

1.83885

2.30103

9. 2.33445 18.9031

2.33445 0.954791 4.90315

18.9031 4.90315 44.078

11.6915

3.94623

22.2077

1.74797

2.61584

0.53678

2 4 6 8

0.006

0.004

0.002

0.000

0.002

0.004

0.006

0.008

Measurement point, i

Res

idua

l,r i

logQi

logQi

Residuals for linearized equation

2 4 6 8

0.0

0.5

1.0

1.5

Measurement point, i

Res

idua

l,r i

QiQi

Residuals for original equation

Page 128: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 128 -

EXAMPLE 68 – APPROXIMATE CONFIDENCE INTERVALS IN PARAMETER ESTIMATES FROM NONLINEAR REGRESSION

(EXAMPLE 22)

Nonlinear regression objective in EXAMPLE 22:

/ 2

,1

ˆ ˆ{ , } arg{min [ (1 )] }

( , )

j

nx

jKj

K y K e

J K

Approximate ( , )J K around ˆ ˆ{ , }K to create linear regression problem:

/ 2 2ˆ ˆ ˆ2

1 1

2ˆ ˆ2

121 1

2

ˆˆ ˆ ˆ[ (1 )] [ { (1 ) (1 )( ) ( )}]

ˆˆ

[ { (1 ) }]ˆ

( )( )

j j j

j

j j

x x xn nx j

j jj j

x xnj

jj

jj

Kxy K e y K e e K K e

Kxy K e e

bb xx

Information matrix for linear regression is

21 1 2 1 1 1 2

1 1 12

2 1 2 1 2 21 2

( ) ( ) ( ) ( ) ( )( ) ( )

( ) ( ) ( ) ( ) (

5.119

)( ) ( )

3.535

3.535 2.928

j j jnT j j

n j j jj jn n

x x x x xx x

x x x x xx x

X X

1

11 22

1.177 1.421

1.421 2.058( ) 1.177, 2.058T C CX X

and

s 0.05008 Therefore, for 0.05 ,

/2 11

(2.306)(0.05008) 1.177 0.1253 2.0 0.1t s C K

/2 22

2.306(0.05008) 2.058 0.1657 1.0 0.2t s C

Do the numbers make sense?

Page 129: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 129 -

- Note: Not both parameters, ,K , can be close to the boundaries of their respective

confidence intervals ˆ ˆ( , )K K K K , ˆ ˆ( , )

- Rather, eqn. (147) 2 2

noiseˆ ˆ[( ) ( ) ] 1T TP X X

Confidence level 1 0.95 2 5.991

2ˆ ˆ[( ) ( ) (0.05008) (5.991)] 0.95T TP X X

as shown in Figure 49.

Figure 49. Joint 95% confidence interval for ,K inside the approximate ellipse

2 2 2 2

noiseˆ ˆ ˆ ˆ( ) ( ) ( ) ( )T T T T sX X X X (inner ellipse, eqn. (147)) or the

exact ellipse 2

102

ˆ ˆ( ) ( ) ( 1) ( 1, 1)T T s k F k n kX X (outer ellipse). Note that the

shaded rectangular area ( , ) ( , )K K is not a good representation of the joint

confidence interval for ,K . Rather, each of the confidence intervals ˆ ˆ( , )K K K K ,

ˆ ˆ( , )is valid for each individual variable K or alone.

Page 130: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 130 -

- Figure 49 suggests that { , } {2.1,1.2}K or { , } {1.9,0.8}K are likely values that fit the

data well, whereas { , } {2.1,0.8}K or { , } {1.9,1.2}K are unlikely values (outside the

confidence ellipses, albeit in the rectangular area ˆ ˆ ˆ ˆ( , ) ( , )K K K K ) that

do not fit the data well. Figure 50 confirms this claim visually.

Figure 50. Fit of experimental data points by the equation /(1 )xK e for likely (left)

and unlikely (right) values of ,K .

0 1 2 3 4 5 6

0.0

0.5

1.0

1.5

2.0

t

T

K 2.1, 1.2

0 1 2 3 4 5 6

0.0

0.5

1.0

1.5

2.0

t

T

K 2.1, 0.8

0 1 2 3 4 5 6

0.0

0.5

1.0

1.5

t

T

K 1.9, 0.8

0 1 2 3 4 5 6

0.0

0.5

1.0

1.5

t

TK 1.9, 1.2

Page 131: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 131 -

EXAMPLE 69 – CONFIDENCE INTERVALS IN PARAMETER ESTIMATES FROM NONLINEAR REGRESSION (EXAMPLE 22)

While the linear approximation used in the previous EXAMPLE 68 offers a straightforward estimate of the confidence intervals in nonlinear regression using linear regression ideas, a direct estimate of confidence intervals can be obtained numerically as the area for the parameters which satisfies the

inequality

number ofnumber ofdata pointsparameters

ˆSSE( ) SSE( ) 1( 1 , 1)

ˆ 1SSE( )

kF k n k

n k

(151)

where / 2

1

SSE( ) [ (1 )]ˆ j

nx

jj

y K e . Software can be used to plot contours of confidence areas

for . For example, Mathematica (function RegionPlot) easily produces the graph in Figure 51.

Figure 51. Joint confidence interval for ,K inside the areas

number of number ofparameters data points

ˆ ˆSSE( , ) SSE( , ) 2( 2 , 10 2)

ˆ 10 2ˆSSE( , )

K KF

K (eqn. (151)) for confidence levels

0.5,0.75,0.875,0.9375,0.961 {1 0.5 , 875,0.91,.. 8437.,7} 5,0.992 7{ 18 5}j j

corresponding to {0.76,1.7,2.7,4.0,5.5,7(2, .3, 58 . }) 9F . Note that the shape of the

ellipsoidal areas is very close to that of the ellipse in Figure 49, which relies on linear

approximation.

Page 132: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 132 -

8.9 Building models: Fitting data vs. making predictions

- When developing a model of unknown structure from available data, merely fitting the data

does not guarantee good predictions by the developed model. - Adding more and more terms (hence parameters) to the model structure will improve data fit...

- …but the predictive ability of the model will eventually suffer

EXAMPLE 70 – PREDICTING THE US POPULATION IN 2010

Year 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000

Population (million)

76.00 91.97 105.71 123.20 131.67 150.70 179.32 203.21 226.51 249.63 281.42

- Given the above data, what would be the US population in 2010?

- Idea: Use polynomials of increasing degree n to fit the data (Shown: Polynomial and 2R )

where Year 1950

50x .

- Which one is better?

0

50

100

150

200

250

300

1880 1900 1920 1940 1960 1980 2000 2020

US Population (million)

165.455 101.227 x 0.981366

156.026 101.227 x 23.5723 x2

0.997969

156.026 100.501 x 23.5723 x2

1.01981 x3

0.997978

154.263 100.501 x 38.8695 x2

1.01981 x3

15.2972 x4

0.998515

154.263 108.751 x 38.8695 x2

33.235 x3

15.2972 x4

27.0433 x5

0.998922

152.267 108.751 x 78.0079 x2

33.235 x3

123.177 x4

27.0433 x5

71.4869 x6

0.99953

152.267 112.661 x 78.0079 x2

67.5677 x3

123.177 x4

99.3158 x5

71.4869 x6

41.9439 x7

0.999568

151.082 112.661 x 123.104 x2

67.5677 x3

379.21 x4

99.3158 x5

518.079 x6

41.9439 x7

234.554 x8

0.999738

Page 133: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 133 -

Figure 52. Polynomial fit of US population data and 95% confidence interval of model

predictions. Note the prediction and 95% confidence interval for the population of 2010,

marked with x. (Mathematica function used: LinearModelFit)

Page 134: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 134 -

Note: Why introduce Year 1950

50x ?

Check condition number of information matrices for polynomials in Year and in x n cond( ),YearX cond( ),xX

EXAMPLE 71 – NONLINEAR REGRESSION

Continue EXAMPLE 70 – Predicting the US Population in 2010. Assume model structure exp( )a bx .

1 145 789. 2.0402

2 2.42235 1010

5.14931

3 4.24034 1015

13.6257

4 7.86376 1020

38.6157

5 1.57622 1026

117.794

6 3.53119 1031

400.884

7 9.39327 1036

1608.92

8 3.43149 1042

8806.48

Page 135: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 135 -

- How to tell what model structure to use? o Not trivial.

o Cross-validation approach: Test several model structures and pick the most promising one as follows.

For a certain model structure,

Use part of the available data to fit the model, i.e. estimate parameter values (fitting part).

Use the model with parameter estimates obtained in the fitting part to calculate predictions on the remaining data (validation part).

Sum up the squared errors for both the fitting and validation parts, to compute the total sum of squared errors

Repeat for more structures. Of all model structures tested, pick the one with the smallest total sum of

squared errors.

Figure 53. General trend of fitting and validation error (difference between model output

and data) in cross-validation approach to model structure selection.

Validation error Fitting (training) error

Total error

Model complexity

Error

Page 136: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 136 -

EXAMPLE 72 – DETERMINING THE MOST PROMISING POLYNOMIAL MODEL STRUCTURE

Continue EXAMPLE 70 – Predicting the US Population in 2010. Fit 7 data points using polynomial of degree n . Validate model using remaining 4 points.

Figure 54. Polynomial fit of US population data using the first 7 data points, and

validation using the remaining 4 data points. Of all polynomials, the quadratic has the

smallest total error (fitting plus validation). Note that 7 data points are not enough for

data fit by polynomial of degree 6n . (Mathematica function used: LinearModelFit)

1.0 0.5 0.0 0.5 1.00

100

200

300

400

5001900 1920 1940 1960 1980 2000

Year 1950 50

US

pop

ula

tion

mill

ion

n 1

1.0 0.5 0.0 0.5 1.00

100

200

300

400

5001900 1920 1940 1960 1980 2000

Year 1950 50

US

pop

ula

tion

mill

ion

n 2

1.0 0.5 0.0 0.5 1.00

100

200

300

400

5001900 1920 1940 1960 1980 2000

Year 1950 50

US

pop

ula

tion

mill

ion

n 3

1.0 0.5 0.0 0.5 1.00

100

200

300

400

5001900 1920 1940 1960 1980 2000

Year 1950 50

US

pop

ula

tion

mill

ion

n 4

1.0 0.5 0.0 0.5 1.00

100

200

300

400

5001900 1920 1940 1960 1980 2000

Year 1950 50

US

pop

ula

tion

mill

ion

n 5

1.0 0.5 0.0 0.5 1.00

100

200

300

400

5001900 1920 1940 1960 1980 2000

Year 1950 50

US

pop

ula

tion

mill

ion

n 6

Page 137: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 137 -

Figure 55. Fitting, validation, and total error for fit of US population data using

polynomials of degree 5n (cf. Figure 54). The best option appears to be 2n .

1 2 3 4 5

10

100

50

20

30

15

70

n

Fitting

SSE

1 2 3 4 5

5001000

50001 104

5 1041 105

n

Val

idat

ion

SSE

1 2 3 4 5

5001000

50001 104

5 1041 105

n

Tot

alSSE

Page 138: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 138 -

Figure 56. Projected US population and corresponding 95% confidence interval according to

the quadratic model 2156.0 101.2 23.57x x (cf. Figure 54). Note the difference between

prediction and projection. The stipulated model only projects what would be future US

population values, if the underlying reasons for population growth (hence the model) remain

the same. If these reasons change, equating projections with predictions should be done with

extreme caution, if at all.

Page 139: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 139 -

9. DESIGN OF EXPERIMENTS FOR EMPIRICAL MODELING

9.1 Basics

What is a model? o Mathematical description of the behavior of a system.

o A model is never exact. It always approximates the behavior of the modeled system.

“All models are wrong, but some are useful.” Prof. George Box20

How is model quality assessed? o Model quality is assessed by explicitly articulating the task for which the model is built

and testing whether the model can actually help accomplish that task. What does a mathematical model look like?

o May come in many forms, from qualitative and descriptive to quantitative Where can a mathematical model of a system (process) be used?

o Process design/development

o Process optimization o Process simulation o Training of plant personnel o Design of control system

20Box, George E. P.; Norman R. Draper (1987). Empirical Model-Building and Response Surfaces, p. 424, Wiley.

Page 140: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 140 -

9.2 Experiment design

What is experiment design? o Decide what experiments to perform to reveal aspects of situation of interest

Figure 57. Contours of reaction yield as function of temperature and concentration. An

experiment refers to experimentally measuring the yield at point P.

The chicken-and-egg dilemma of experiment design

o Resolution of dilemma: Sequential experiments

Improved system knowledge

Optimal experimental conditions Conduct experiment Initial system knowledge

System knowledge Optimal experimental conditions

Page 141: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 141 -

EXAMPLE 73 – EXPERIMENT DESIGN FOR STRAIGHT LINE FIT: THE ARRHENIUS PLOT

Arrhenius plot: Reaction rate exp log logE E

r k r kRT RT

. Plot logr vs. 1

T)

Which is better?

Figure 58. Two experiment designs for Arrhenius plot. Which is better?

Dispelling a gross misunderstanding: It is NOT the only purpose of experiment design to minimize the effect of noise on conclusions. However, statistics can help design experiments for estimation influenced by noise as little as possible.

Page 142: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 142 -

EXAMPLE 74 – EXPERIMENT DESIGN USING SIMILARITY

Reactor 2 is like reactor 1 Contour plot for yield of reactor 2 is like contour plot of reactor 1

Figure 59. Using prior knowledge (yield for reactor 1) to narrow down the range over which

experiments will be conducted for reactor 2.

Page 143: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 143 -

EXAMPLE 75 – FACTORIAL VS. HAPHAZARD ARRANGEMENT OF EXPERIMENTAL POINTS

Factorial design Analysis of results easy and reliable

Figure 60. Factorial design of experiments.

170 190

20

40

Temperature C

Con

cent

ratio

n

Reactor1, Catalyst 1

170 190

20

40

Temperature C

Con

cent

ratio

n

Reactor1, Catalyst 2

170 190

20

40

Temperature C

Con

cent

ratio

n

Reactor2, Catalyst 1

170 190

20

40

Temperature C

Con

cent

ratio

nReactor2, Catalyst 2

Page 144: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 144 -

Haphazard design Analysis of results difficult and unreliable

Figure 61. Experimental data collected at random points.

170 190

20

40

Temperature C

Con

cent

ratio

n

Reactor1, Catalyst 1

170 190

20

40

Temperature C

Con

cent

ratio

n

Reactor1, Catalyst 2

170 190

20

40

Temperature C

Con

cent

ratio

n

Reactor2, Catalyst 1

170 190

20

40

Temperature C

Con

cent

ratio

n

Reactor2, Catalyst 2

Page 145: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 145 -

9.3 Comprehensive vs. sequential approach to experimental investigations

Paradox: "The best time to design an experiment is after it is finished!"

Problems with comprehensive experiment design: Need to know at outset

o Which variables are most important

o What range of variables should be studied

o Model structure (e.g., 0 1 1 2 2

?y x x 1

0 1 2 2?y x x

1

0 1 2 2?y x x 2

0 1 2 1 2 3ln ?y x x x )

Advantages of sequential experiment design o Location of experiments may change to more promising neighborhood o Some of initial variables may be dropped, others substituted o Some variables may be considered in transformations o Objective of investigation may change

25% Rule of Thumb: "No more than 25% of experimental effort should be invested in a first design." [BHH], Chapter 9

Page 146: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 146 -

EXAMPLE 76 – SEQUENTIAL DESIGN OF EXPERIMENTS FOR CONTROL SYSTEM DESIGN

Objective: Develop empirical dynamic model for use in model-based computer control of oil refinery

Approach o Pre-testing

Implement simple step changes of manipulated input variables u (MV) to gauge time constants for effects on controlled output variables y (CV)

o Testing

Implement carefully designed inputs to reveal dynamics of controlled process

Figure 62 – Input / output data and parameter estimates for model structure

1noise

k k ky ay bu using least squares

2

1,

ˆˆ( , ) arg min ( )k k ka b

k

a b y a y b u

M u(t)

t

y(t)

t

Page 147: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 147 -

9.4 Critique on mathematical theories of optimal experiment design [BHH, Chapter 9]

Mathematical theories of optimal design focus o At what points in a known region, of a defined factor space to run experiments

Mathematical theories of optimal design usually assume known o Variables of interest

o Region of experimentation o Metrics and transformations of variables

o Mathematical model structure o Single design criterion

Page 148: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 148 -

9.5 Two-level factorial designs

What is this? - Design experiments and analyze effects of two factors that can vary continuously

9.5.1 What is a factorial experiment?

Collection of experimental runs where all factors change simultaneously rather than one at a time.

Figure 63. A two-level factorial experiment with two factors without (top) and with

interaction (bottom).

Low HighFactor A

Low

High

Factor B

20

30 52

40

Low HighFactor A0

10

20

30

40

50

60

Response

Low B

Low B

High B

High B

Low HighFactor A

Low

High

Factor B

20

40 50

12

Low HighFactor A0

10

20

30

40

50

60

Response

Low B

Low B

High B

High B

Page 149: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 149 -

Figure 64. Response surface and contour plot for the model 1 2

ˆ 35.5 10.5 5.5y x x

Figure 65. Response surface and contour plot for the model 1 2 1 2ˆ 35.5 10.5 5.5 8y x x x x

Page 150: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 150 -

9.5.2 Why factorial designs at two levels?

Relatively few runs required

Can be suitably augmented for further local exploration through composite designs

Basis for fractional factorial designs o Early experimentation: Explore many factors superficially rather than few (possibly

unimportant) at depth

Building blocks of sophisticated designs (along with fractional designs)

Interpretation of observations simple

Definition 49 – km factorial design

km factorial design k variables, each at m levels

9.5.3 What are deviation variables and why use them?

Definition 50 – Deviation variables

ˆx x x

Linear regression model coefficients (except constant term) depend on ,x y rather than

,x y

Nonlinear regression models can be approximately linearized via Taylor series.

Page 151: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 151 -

EXAMPLE 77 – MULTIPLE LINEAR REGRESSION WITH DEVIATION VARIABLES

Linear model

0 1 1

0 1 1 1

0 1 1 1 1

0 1 1

...

( ) ... ( )

... ...

...

k k

k k k

k k k k

k k

y x x

x x x x

x x x x

x x

Eqn. (126) optimal estimate of 0

is y regression model becomes

1 1

...k k

y x x (152)

with ˆy y y .

Information matrix (eqn. (128)) for eqn. (152) for n experiments is

1 1 1

1 1 1

1

1

( ) ( )

( ) ( )

( ) ( )

( ) ( )

k

nT

k k n

n k n

x x

x x

x x

x x

X X

Definition 51 – The size of the information matrix

The information matrix TX X is "large" if the product of its eigenvalues or its smallest eigenvalue is large.

- Large information matrix results in high precision of parameter estimates.

Factorial design objective: Make

TX X "large"

Page 152: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 152 -

Theorem 42 – Optimal settings of deviation variables for factorial designs

Optimal settings for deviation variables in factorial designs form orthogonal vectors, resulting in

diagonal information matrix TX X .

EXAMPLE 78 – ORTHOGONAL SETTINGS FOR DEVIATION VARIABLES

Orthogonal settings for linear regression model 1 2 1 2

y ax bx c y a x b x

1 1

1 1

1 1

1 1

X

Definition 52 – Denoting deviation variables at their high or low values in two-level factorial experiments

High value, Low value

EXAMPLE 79 – ORTHOGONAL SETTINGS FOR DEVIATION VARIABLES OF EXAMPLE 78

1 1

1 1

1 1

1 1

X X

Table 5. Alternative notation for a factorial design with two variables at two levels

Run 1x 2

x Label 1x 2

x

1 + + 1 2x x +1 +1

2 + 1x +1 1

3 + 2x 1 +1

4 1 1 1

1 1Factor A

1

1

Factor B

Page 153: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 153 -

9.5.4 Why factorial and not one-at-a-time experiments?

Factorial experiments "excite" a process more, thus increasing signal-to-noise ratio. o If effect of factors is linear, factorial experiments higher precision than in one-at-a-

time experiments o If effect of factors is nonlinear (factors interact), factorial experiments better

detection and estimation of interactions than in one-at-a-time experiments

Page 154: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 154 -

EXAMPLE 80 – FACTORIAL DESIGN VS. ONE-AT-A-TIME DESIGN

If 1 2

( , )y f x x , what is 1 2

( , )f x x ?

Candidates:

1 2 1 2

y ax bx c y a x b x , (153)

1 2 1 2 1 1 2 2 1 1 2 2

1 2 1 2

1 2 1 2

( ) ( ) ( )( )y x x x x x x x x x x x x

x x x x

y x x x x

(154)

Factor scaling: High 1, Low 1, Average 0 .

Figure 66. Fractional design and one-at-a-time design with equal number of experiments

Precision of estimates of ,a b and , , depends on information matrix, Theorem 37, p. 117:

- Linear model, linear regression: o Factorial design

11

2 2

noise3

4

1 1ˆ1 1 4 0 4 0

Var ˆ1 1 0 4 0 4Factorial

1 1

T

y

y a Ae

y b B

y

X X

X

o One-at-a-time design

11

2 2

noise3

4

1 0ˆ1 0 2 0 2 0

Var ˆ0 1 0 2 0 2One-at-a-time

0 1

T

y

y a Ae

y b B

y

X X

X

ˆ ˆ

Var Varˆ ˆOne-at-a-time Factorial

A A

B B

1 1Factor A

1

1

Factor B

1 1Factor A

1

1

Factor B

Page 155: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 155 -

- Nonlinear model, linear regression:

o Factorial design

1

2

3

4

2

noise

1 1 14 0 0

1 1 10 4 0

1 1 10 0 4

1 1 1

ˆ 4 0 0ˆVar 0 4 0ˆ 0 0

Factorial

T

y

ye

y

y

X X

X1

4

o One-at-a-time design

1

2

3

4

2

noise

1 0 02 0 0

1 0 00 2 0

0 1 00 0 0

0 1 0

ˆ 2 0 0ˆVar 0 2 0ˆ 0 0

One-at-a-time

T

y

ye

y

y

X X

X1

0

ˆ ˆ

Var Varˆ ˆOne-at-a-time Factorial

A A

B B

Page 156: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 156 -

9.5.5 Effects of factors and interactions among factors

Definition 53 – Interactions among factors

0

Interaction

...

main effects effect oftwo-factor interactions effect of

three-factor interactions

i i ij i j ij i ji j i j j k i j

i

y x x x x x x

y

s

(155)

How many terms (effects) would there be in a model with k factors?

EXAMPLE 81 – NUMBER OF EFFECTS IN MODEL WITH 3 FACTORS

0

1 1 2 2 3 3

12 1 2 23 2 3 13 1 3

123 1 2 3

main effects

effect oftwo-factor interactions

effect ofthree-factor interac

y

y

x x x

x x x x x x

x x x

tions

Total number of effects (terms) in above equation:

31 3 3 1 8 2 Generalize for eqn. (155)

0

1 1

12 1 2 1, 1

123 1 2 3 2, 1, 2 1

12

1 term

terms: ...

( 1) terms: ...

2( 1)( 2)

terms: ...2 3

...

i i k ki

ij i j k k k kj i j

ij i j k k k k k kj i j

i

x k x x

k kx x x x x x

k k kx x x x x x x x x

1 21 term

k kx x x

Total number of terms (coefficients )

Page 157: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 157 -

! ! !

1!( 1)! 2!( 2)! 3!( 3)!

( 1) ( 1)( 2)1 ... 1 1 ... 1 (1 1) 2

1 2 32 2 31 terms

k k

k k k

k k k

k k kk k k k kk

k

Full two-level factorial experiment provides enough data to estimate effects of all factors and

all of their interactions.

Fractional two-level factorial experiment provides enough data to estimate effects of all factors and some of their interactions.

Page 158: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 158 -

EXAMPLE 82 – 23FACTORIAL DESIGN: PILOT PLANT INVESTIGATION

Effect of temperature, catalyst concentration, and kind of catalyst on reaction yield.

Full factorial experiment design. Duplicate experiments are run at each point.

Test condition # Temperature, T

(°C)

Concentration, C

(%)

Catalyst, K

(A or B)

Yield, y

(grams)

Original units of variables

1 160 20 A 60

2 180 20 A 72 3 160 40 A 54

4 180 40 A 68

5 160 20 B 52 6 180 20 B 83

7 160 40 B 45 8 180 40 B 80

Coded units of variables

1 - - - 60

2 + - - 72 3 - + - 54

4 + + - 68 5 - - + 52

6 + - + 83

7 - + + 45

8 + + + 80

Parameter estimates

Linear model structure:

0

main effects

effect oftwo-factor interactions

effect ofthree-factor inte

T C K

TC CK TK

TCK

y

yT C K

T C C K T K

T C K

ractions

.

Estimates:

1 80

...64.25

8

y yy

Page 159: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 159 -

1 1 1 1 1 1 1 1 1 1 1 1 1

8 8 8 8 8 8 8 8 8 8 8 8 8

y T C K T C C K T K T C K

y T C K T C C K T K T C K

e

y X

T

C

K

TC

CK

TK

TCK

X T C K T C C K K T T C K

-10 -10 1 100 -10 -10 100

10 -10 1 -100 -10 10 -100

-10 10 1 -100 10 -10 -100

10 10 1 100 10 10 100

-10 -10 -1 100 10 10 -100 10 -10 -1 -100 10 -10 100

-10 10 -1 -100 -10 10 100 10 10 -1 100 -10 -10 -100

diag(800, 800, 8, 80000, 800, 800, 80000)TX X

920 200 6 600 0 400 200T

TX y

ˆ ˆ ˆ ˆ ˆ ˆ ˆ 1.15 0.25 0.75 0.0075 0 0.5 0.0025T C K TC CK TK TCK

or, in coded units

T C K T C C K K T T C K

-1 -1 -1 1 1 1 -1

1 -1 -1 -1 1 -1 1 -1 1 -1 -1 -1 1 1

1 1 -1 1 -1 -1 -1

-1 -1 1 1 -1 -1 1

1 -1 1 -1 -1 1 -1

-1 1 1 -1 1 -1 -1

1 1 1 1 1 1 1

diag(8, 8, 8, 8, 8, 8, 8)TX X

92 20 6 6 0 40 2T

TX y

ˆ ˆ ˆ ˆ ˆ ˆ ˆ 11.5 2.5 0.75 0.75 0 5 0.25T C K TC CK TK TCK

Are all effects important?

Page 160: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 160 -

Confidence intervals cannot be calculated from previous data (cf. eqn. (141)and on). Calculate standard errors for effects using replicated runs

Test condition # 1 2 3 4 5 6 7 8

y , average of two runs 60 72 54 68 52 83 45 80

y , run 1 59 74 50 69 50 81 46 79

y , run 2 61 70 58 67 54 85 44 81

Then

16 16

2 diag(8, 8, 8, 8, 8, 8, 8)TX X

2 92 20 6 6 0 40 2T

TX y

ˆ ˆ ˆ ˆ ˆ ˆ ˆ 11.5 2.5 0.75 0.75 0 5 0.25T C K TC CK TK TCK (same as before!)

and eqn. (141) with = 1 – Confidence level = 0.10

/2t = appropriate point of T-distribution with ( 1) 16 8 8n k DOF = 2.3

2

1 2

ˆSSE 64

2.828ˆ ˆ( 1) ( 1) 16 8( 1)

n

jje

sn k n k n k

y X

1 1[( ) ] 0.25

16T

ii iiC X X

/2

ˆ ˆ 1.6i i ii i

t s c , 1,...,8i (156)

Therefore

Effect Estimate 90% confidence

average 64.25 0.8 Temperature, T 11.5 1.6

Concentration, C -2.5 1.6 Catalyst, K 0.75 1.6

T C 0.75 1.6

C K 0 1.6

K T 5 1.6

T C K 0.25 1.6

Page 161: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 161 -

What was the source of variability? Randomized pilot plant run involves the following steps:

1. Clean reactor

2. Insert catalyst 3. Run at given temperature and feed concentration for 3 hours to settle at experimental

conditions 4. Sample every 15 minutes during last hour 5. Combine chemical analysis of samples

Block what you can and randomize what you cannot. - R. A. Fisher

Page 162: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 162 -

9.5.6 Analysis of factorials through visual inspection

Sometimes simple inspection is enough.

EXAMPLE 83 – EFFECT OF THREE VARIABLES ON CLARITY OF FILM

Visual inspection Emulsifier B important

12

-4

-3

-6

-7

#1 #2

#3 #4

#5 #6

#7 #8

Emulsifier, A (%)

Emulsifier,

B (%)

Catalyst concentration, K

2 3

+

+

+

-

- - 0

2

1/2

1

Page 163: CHEE6330 LectureNotes Learning From Data(2)

CHEE 6397 Lecture Notes Learning from Data Michael Nikolaou

- 163 -

EXAMPLE 84 – EFFECT OF THREE VARIABLES ON PHYSICAL PROPERTIES OF POLYMER SOLUTION

Variables: 1. Amount of reactive monomer (%): 10, 30 2. Type of chain length regulator: A, B 3. Amount of chain length regulator (%): 1, 3

Formulation Variable Response

1 2 3 Milky? Viscous? Yellow?

1 - - - yes yes no 2 + - - no yes no

3 - + - yes yes no 4 + + - no yes slightly

5 - - + yes no no

6 + - + no no no

7 - + + yes no no

8 + + + no no slightly By inspection:

Milky = f(Variable 1)

Viscous = f(Variable 3)

Yellow = f(Interaction of variables 1 and 2)...