tma 4255 applied statistics - ntnu · tma 4240/45 statistics and st 0103: probability theory +...

1

TMA 4255 Applied Statistics

Spring 2010

2

About the course

Learning outcomeThe objective of the course is to give the students a solid foundation for use of basic statistical methods in science and technology. In addition the students shall be capable of planning collection of data and to use statistical software for analysing data.

Learning methods and activitiesLectures and exercises with the use of a computer (computing programme MINITAB). The lectures may be given in English. Portfolio assessment is the basis for the grade awarded in the course. This portfolio comprises a written final examination 80% and selected parts of the exercises 20%. The results for the constituent parts are to be given in %-points, while the grade for the whole portfolio (course grade) is given by the letter grading system. Retake of examination may be given as an oral examination.

Compulsory assignmentsExercises

Recommended previous knowledgeThe course is based on ST0103 Statistics with Applications/TMA4240 Statistics/TMA4245 Statistics, or equivalent.

3

Contents (preliminary list)Hypotheses testing, simple and multiple linear regression, residual plots and selection of variables, transformations, design of experiments, 2^k experiments and fractions of these. Special designs. Graphical methods. Error propagation formula. Analysis of variance, statistical process control, contingency tables and nonparametric methods. Use of statistical computer package, MINITAB.

LecturerProfessor Bo Lindqvist, Room 1129, Sentralbygg II, NTNU Gløshaugen Telephone: (735) 93532 Email: [email protected]

Teaching assistantStipendiat Håkon Toftaker, Room 1036, Sentralbygg II, NTNU GløshaugenTelephone (735) 91681Email: [email protected]

http://www.math.ntnu.no/~bo/

mailto:[email protected]

http://www.math.ntnu.no/~toftaker/

mailto:[email protected]

4

Teaching material

Main book:Walpole, Myers, Myers and Ye: "Probability and Statistics for Engineers and Scientists". Eighth Edition. Pearson International Edition.

Tables:”Tabeller og formler i statistikk”, 2. utgave. Tapir 2009.

MINITAB:Information is found on http://www.ntnu.no/adm/it/brukerstotte/programvare/minitab.

http://www.ntnu.no/adm/it/brukerstotte/programvare/minitab

5

Weekly meetings

LecturesTuesdays 12.15 – 14.00 H3Thursdays 12.15 – 14.00 S4

ExercisesMondays 17.15 – 18.00 in H3

or a computer lab

6

Week Topic Chapter (WMMY) Exercise

2Introduction, motivation and repetition.

Descriptive measures and graphs. Normal plot.

(1-10) Particularly 8.1-8.7 1

3Two-sample case. Comparing variances.

F-distribution. Simple linear regression,

8.8, 9.8, 9.13, 10.8, 10.13, 10.18.(11.1-11.5) 11.6 – 11.12 2-3

4-5 Multiple linear regression 12.1 -12.7, 12.9-12.11 4-5

6-8 2k experiment and fractions thereof BHH 10 and 12 Alternatively, WMMY 15 6-8

9-10 Analysis of variance 13.1-13.4,13.6,13.8-13.10,13.13,13.15 14.1-14.4 9-10

11 Statistical process control 17.1-17.5 11

12 Chi-square tests and Contingency tables 10.14-10.16 12

13-14 (Tuesd) Easter vacation

14(Thursd)-15 Nonparametric statistics 16.1-16.3 12

16 Approximation of expectation and variance

17(Tuesd) Repetition

Preliminary curriculum, lecturing and progress plan

15

The compulsory project: Example

18

Introduction to course

TMA 4240/45 Statistics and ST 0103: Probability theory + simple statistics

TMA 4255 (this course):A little probability + APPLICABLE and APPLIED statistics

The ”classical” statistical methods:

• Regression analysis• Design of experiments• Analysis of variance (ANOVA)• Analysis of discrete data (contingency tables)• Nonparametric methods

19

Why is statistics important in science and industry?

The book emphasizes ”the Japanese industrial miracle”:

•Use of statistical methods in design and production•Statistical thinking in all parts of the production

In 2000, the highly reputated international medical journal New England Journal of Medicine appointed

•Use of statistical methods

as one of the 11 most important medical advances throughout time

21

Originally: Statistics = ”collection and presentation of data”

Today much more:

•Design and collection of data from statistical investigations

•Modeling of the stochastic mechanisms behind the data

•Drawing conclusions about these mechanisms, based on the data

•Evaluation of the strength of the conclusions (variance, confidence interval, test power)

•Basic tool: Probability theory

22

Statistical investigations can be divided into two main types:

•Experimental studies based on design of experiments (DOE): Experiments are done under controlled conditions.

•Observational studies: When control of conditions are not possible.

23

Statistical studiesEksperimental studies:Clinical trialsComparison of drugs A og BTrial group of n persons

r drawn at random aregiven A

s drawn at random aregiven B

n-r-s (rest) get Placebo

Blind test: Patient does not know kind of drug

Double blind: Examining doctor does not know either

Observational studies:Epidemiological experiments• Smoking and cancer• Diet and coronary diseases

Cannot control the conditions; e.g. cannot force people to smoke/not smoke.

Difficulty in interpretation: May be unknown underlying causes which make the results biased (”confounding”). For example: A gene which increases the need for smoking, and at the same time influences chance of getting cancer.

24

Statistics in scientific investigations:

•Generate hypotheses

•Derive consequences

•See whether these are fulfilled in observations

•Generate new hypotheses

•Etc.

25

Particular matters for statistics in industry:

Product and process engineer:

Off-line: Controlled experiments aiming at optimizing production

Production: Register production data; use them to control production.

26

MINITAB 15: Rocket fuel example Stat > Basic Statistics > Display Descriptive Statistics

27

Stat > Basic Statistics > Graphical Summary

42403836

Median

Mean

42,041,541,040,540,039,539,0

1st Q uartile 39,275Median 41,0003rd Q uartile 42,000Maximum 42,600

39,136 41,984

39,266 42,037

1,369 3,634

A -Squared 0,59P-V alue 0,092

Mean 40,560StDev 1,991V ariance 3,963Skewness -1,55360Kurtosis 2,72795N 10

Minimum 35,900

A nderson-Darling Normality Test

95% C onfidence Interv al for Mean

95% C onfidence Interv al for Median

95% C onfidence Interv al for StDev95% Confidence Intervals

Summary for X

28

More plots: Rocket fuel data

42,341,440,539,638,737,836,936,0X

Dotplot of X

47,545,042,540,037,535,0

99

95

90

80

70

60504030

20

10

5

1

X

Perc

ent

Mean 40,56StDev 1,991N 10AD 0,589P-Value 0,092

Probability Plot of XNormal - 95% CI

45,042,540,037,535,0

100

80

60

40

20

0

X

Perc

ent

Mean 40,56StDev 1,991N 10

Empirical CDF of XNormal

31

Statistical inference with MINITAB: Rocket fuel data

Stat > Basic Statistics > 1-Sample Z Assume known standard deviation sigma=2

One-Sample Z: X

Test of mu = 40 vs not = 40The assumed standard deviation = 2

Variable N Mean StDev SE Mean 95% CI Z PX 10 40,560 1,991 0,632 (39,320; 41,800) 0,89 0,376

One-Sample Z: X

Test of mu = 40 vs > 40The assumed standard deviation = 2

95% LowerVariable N Mean StDev SE Mean Bound Z PX 10 40,560 1,991 0,632 39,520 0,89 0,188

32

Stat > Basic Statistics > 1-Sample t Assume unknown standard deviation sigma

One-Sample T: X

Test of mu = 40 vs not = 40

Variable N Mean StDev SE Mean 95% CI T PX 10 40,560 1,991 0,629 (39,136; 41,984) 0,89 0,397

One-Sample T: X

Test of mu = 40 vs > 40

95% LowerVariable N Mean StDev SE Mean Bound T PX 10 40,560 1,991 0,629 39,406 0,89 0,198

34

One and two sample tests concerning means

35

Example: The industrial experment

An experiment was performed on a manufacturing plant by making in sequence 10 batches using the standard production method (A), followed by 10 batches of a modified method (B). The results from these trials are given in the table on the next slide. What evidence do the data provide that method B is better than method A?

38

(Assuming equal variances:)

39

(Not assuming equal variances:)

40

Two sample t-test

41

Paired observations (example)

44

Met

ode

95% Bonferroni Confidence Intervals for StDevs

2

1

8765432

Met

ode

X

2

1

92908886848280

F-Test

0,390

Test Statistic 1,58P-Value 0,505

Levene's Test

Test Statistic 0,78P-Value

Test for Equal Variances for X

46

Regression analysisResponseDependent variable (WMMY)

Y

Explanatory variablePredictorCovariateIndependent variable (WMMY)Regressor (WMMY)

Goal: Describe Y as function of the xi,

47

Linear Regression

48

Typical data

49

EXAMPLE

Connection between stiffness (stivheit) and density (tettleik) of a tree product.

50

Plot of stiffness versus density

252015105

100000

80000

60000

40000

20000

0

tettleik

stiv

heit

Scatterplot of stivheit vs tettleik

52

Least squares regression line

252015105

100000

80000

60000

40000

20000

0

tettleik

stiv

heit

Scatterplot of stivheit vs tettleik

56

Regression line for logstiv ( = log(stivheit) )

252015105

11,5

11,0

10,5

10,0

9,5

9,0

8,5

tettleik

logs

tiv

Scatterplot of logstiv vs tettleik

59

Predicted values for new observations

66

Multiple Linear Regression

67

Example – Acid rain in Norwegian lakesData from a study of the influence of acid rain on Norwegian lakes, made in 1986. Totally 1005 lakes were studied. In this example are chosen 26 lakes, 16 from Telemark and 10 from Trøndelag.

Variable name

Meaning

x1 Content of SO4 (mg/l)

x2 Content of NO3 (mg/l)

x3 Content of Ca (mg/l)

x4 Content of latent aluminum (mu g/l)

x5 Content of organic substance (mg/l)

x6 Area of lake

x7 Site (0 = Telemark, 1 = Trøndelag)

y Measured pHz Fish status (1 = no damage, 2 = minor damage, 3 = major damage, 4 = no fish)

ROW x1 x2 x3 x4 x5 x6 x7 y z1 4.9 39 1.54 78 2.02 0.30 0 5.38 32 4.1 75 1.55 17 2.98 1.85 0 5.68 13 3.5 80 0.83 157 3.40 0.25 0 5.04 34 3.8 75 0.53 163 3.42 0.30 0 4.81 45 3 8 90 0 82 105 3 41 0 25 0 4 92 4

76

Model using x1,x2,x3

81

Simulation20 observations are simulatedfrom the modelY = 1 + ….WithEpsilon N(0,0.02^2)X1 uniform on (1,3)X2 is N(1,0.5^2)

82

The wrong model

Y = beta0 …..

is estimated by MINITAB

83

Normal plot looks at first OK, but has a systematic convex shape. Thehistogram is skew to the left, while the other plots are fairly OK.

84

The residuals vs x1 have an ”opposite U” shape, which suggests that model (1)is not correct. Residuals vs x2 look better, but there is possibly something wronghere, too.

85

The residual plot suggests thatx1 should be transformed. Weuse log x1 (which is of coursecorrect since we know theunderlying model. Estimate thegiven model in MINITAB

88

Example: Multicollinearity

92

Factorial Experiments

99

Back to two-factor experiment

100

DOE terminology – main effects:

101

Interaction effects

103

Three factors

z1 z2 z3 z12 z13 z23 z123

104

Two-factor interaction

105

Three-factor interaction

106

Or: using + and – in columns:

109

Cube plot

110

Four factors – example

111

Full Factorial Design

Factors: 4 Base Design: 4; 16Runs: 16 Replicates: 1Blocks: 1 Center pts (total): 0

All terms are free from aliasing.

112

Factorial Fit: Y versus A; B; C; D

Estimated Effects and Coefficients for Y (coded units)

Term Effect CoefConstant 72,250A -8,000 -4,000B 24,000 12,000C -2,250 -1,125D -5,500 -2,750A*B 1,000 0,500A*C 0,750 0,375A*D -0,000 -0,000B*C -1,250 -0,625B*D 4,500 2,250C*D -0,250 -0,125A*B*C -0,750 -0,375A*B*D 0,500 0,250A*C*D -0,250 -0,125B*C*D -0,750 -0,375A*B*C*D -0,250 -0,125

S = *

113

Analysis of Variance for Y (coded units)

Source DF Seq SS Adj SS Adj MS F PMain Effects 4 2701,25 2701,25 675,313 * *2-Way Interactions 6 93,75 93,75 15,625 * *3-Way Interactions 4 5,75 5,75 1,438 * *4-Way Interactions 1 0,25 0,25 0,250 * *Residual Error 0 * * *Total 15 2801,00

114

Effect

Perc

ent

2520151050-5-10

99

95

90

80

7060504030

20

10

5

1

Factor

D

NameA AB BC CD

Effect TypeNot SignificantSignificant

BD

D

B

A

Normal Probability Plot of the Effects(response is Y, Alpha = ,05)

Lenth's PSE = 1,125

115

Term

Effect

AD

CDACD

ABCDABD

ACABCBCD

ABBC

CBDDA

B

2520151050

2,89Factor

D

NameA AB BC CD

Pareto Chart of the Effects(response is Y, Alpha = ,05)

Lenth's PSE = 1,125

116

Mea

n of

Y

1-1

84

78

72

66

601-1

1-1

84

78

72

66

60

1-1

A B

C D

Main Effects Plot (data means) for Y

117

A

1-1 1-1 1-1

80

70

60

B

80

70

60

C

80

70

60

D

A-11

B-11

C-11

Interaction Plot (data means) for Y

118

Example: Three factors and replicate

119

Factorial Fit: Y versus A; B; C

Estimated Effects and Coefficients for Y (coded units)

Term Effect Coef SE Coef T PConstant 64,250 0,7071 90,86 0,000A 23,000 11,500 0,7071 16,26 0,000B -5,000 -2,500 0,7071 -3,54 0,008C 1,500 0,750 0,7071 1,06 0,320A*B 1,500 0,750 0,7071 1,06 0,320A*C 10,000 5,000 0,7071 7,07 0,000B*C 0,000 0,000 0,7071 0,00 1,000A*B*C 0,500 0,250 0,7071 0,35 0,733

S = 2,82843 R-Sq = 97,63% R-Sq(adj) = 95,55%

Analysis of Variance for Y (coded units)

Source DF Seq SS Adj SS Adj MS F PMain Effects 3 2225,00 2225,00 741,667 92,71 0,0002-Way Interactions 3 409,00 409,00 136,333 17,04 0,0013-Way Interactions 1 1,00 1,00 1,000 0,13 0,733Residual Error 8 64,00 64,00 8,000

Pure Error 8 64,00 64,00 8,000Total 15 2699,00

120

Term

Standardized Effect

BC

ABC

C

AB

B

AC

A

181614121086420

2,31Factor NameA AB BC C

Pareto Chart of the Standardized Effects(response is Y, Alpha = ,05)

121

Standardized Effect

Perc

ent

151050-5

99

95

90

80

70

60504030

20

10

5

1

Factor NameA AB BC C

Effect TypeNot SignificantSignificant

AC

B

A

Normal Probability Plot of the Standardized Effects(response is Y, Alpha = ,05)

122

Blocking in 2^k experimentsFull experiment:

Two blocks:Use column ABC as generator, i.e.Block 1 consists of experiments with ABC = -1Block 2 consists of experiments with ABC = 1

123

The interaction ABC is confounded (”mixed”) with the block effect. This means that the value of the estimated coefficient of ABC can bedue to both interaction effect and block-effect.

Suppose all Y in block 2 are increased by a value h. Then the estimatedeffect of ABC will increase by h. But one cannot know from observationswhether this is due to the interaction ABC or the block effect.

On the other hand, the estimated main effects A,B,C and the two-factorinteractions AB,AC,BC are not changed by the h. These are of mostimportance to estimate, so the choice of blocking seems reasonable.

124

Four blocks in 2^3 experiment

Need two columns of +/- to define 4 blocks. Turns out that the bestoption is to use two two-factor interactions, e.g. AB and AC (whichis default in MINITAB

Block 1: Experiments where AB = AC = -1Block 2: Experiments where AB = -1, AC = 1Block 3: Experiments where AB = 1, AC = -1Block 4: Experiments where AB = AC = 1

125

Block structure is as follows:

Interaction effects AB and AC are confounded with the block effect, since theyare generators for the blocks. In addition, their product AB*AC = AABC = BCis confounded with the block effect (Note: the BC column is constant within eachblock.

Adding h2 to block 2, h3 to block 3, h4 to block 4 does not change estimatedeffects of A,B,C, and also does not change the third order interaction ABC. However, e.g. AB will change by 2h3+2h4-2h2 and we do not know whetherthis is due to an interaction effect or blocking effect: This is CONFOUNDING.

126

How to determine which columns to use for blocking?

Idea: Try to leave estimates for main effects and low order interactionsunchanged by blocking.

Note: I = AA = BB = CC where I is a column of 1’s

Find the blocks for a 2^3 experiment using columns ABC and AC.The interaction between ABC and AC is

ABC*AC = AA*B*CC = B

which is a main effect, which hence is confounded with the block effect(in addition to ABC and AC)

128

Example obligatory project

132

MINITAB plots

133

Assuming third and fourth order interactions are 0

134

Interaction plots

140

From Exam in TMA4260 Industrial Statistics, december 2003, Exercise 2

A company decides to investigate the hardening process of a ballbearing production. The following four factors are chosen:

A: content of added carbonB: Hardening temperatureC: Hardening timeD: Cooling temperature.

Design and results are given below:

a) What is the generator and the defining relation of the design, and what is thedesign’s resolution? Write down the alias structure.

Find the estimates of the main effect of A and the interaction effect AC.

141

b) What is the variance of the main effect A and the interaction AC?

Assume that the st deviation sigma has been estimated from other experiments, by s = 0.312 with 9 degrees of freedom (in the exam, this had been done in Ex 1.)Use this estimate to find out whether the interaction AC is significantly different from 0 (i.e. ”active”) Use 5% significance level. What is the conclusion of the experimentso far?

142

The company is well satisfied with the results so far and they decide to carry outalso the other half fraction. The result of the other half fraction is given below.

143

Use this to find unconfounded estimates for the main effects and the two-factorinteractions.

Assume that one would like to estimate the variance of the effects from thehigher order interactions. Explain how this can be done, and find the estimate.Is it wise to include the four-factor interaction in this calculation? Why (not)?

Later, one of the operators that participated in the experiments asked whetherone could have carried out the first half fraction in (a) in two blocks. This would,he said, have simplified considerably the performance of the experiments. What answer would you give to the operator?

144

From Exam in SIF 5066 Experimental design and…, May 2003, Exercise 1

A company making ballbearings experienced problems with the lifetimes of theproducts. In an experiments that they carried out they considered the factors

A: type of ball – standard (-) or modified (+)B: type of cage - standard (-) or modified (+)C: type of lubricate - standard (-) or modified (+)D: quantity of lubricate – normal (-) or large (+)

The repsonse was the lifetime of the ballbearing in an accelerated life testing experiment. The results are given on the next page.

145

A: type of ball B: type of cageC: type of lubricateD: quantity of lubricate

a) What type of experiment is this? What is the defining relation? What is theresolution? Calculate estimates of the main effect of A and the two-factorinteraction AB.

b) Estimated contrasts for B,C,D,AC,AD are, respectively, 0.60, 0.31, 0.22, -0.11,-0.01. What can you say about the estimated effects for CD, BD, BC. BCD, ACD,ABD,ABC?Assume that factors C and D do not influence the response. Explain why this is then a 2^2 experiment with replicate. Calculate an estimate for the variance of theeffects, and find out whether A, B and AB are now significant.

c) Give an interpretation of the results. The experiment was in fact carried out in twoblocks, where experiments 1-4 was one block and 5-8 the other. How is thisblocking constructed? How will we need to modify the analysis of significance in (b)?(Assume again that C,D do not influence the response)

tma 4255 applied statistics - ntnu · tma 4240/45 statistics and st 0103: probability theory +...

Documents