tma 4255 applied statistics - ntnu · tma 4240/45 statistics and st 0103: probability theory +...
TRANSCRIPT
1
TMA 4255 Applied Statistics
Spring 2010
2
About the course
Learning outcomeThe objective of the course is to give the students a solid foundation for use of basic statistical methods in science and technology. In addition the students shall be capable of planning collection of data and to use statistical software for analysing data.
Learning methods and activitiesLectures and exercises with the use of a computer (computing programme MINITAB). The lectures may be given in English. Portfolio assessment is the basis for the grade awarded in the course. This portfolio comprises a written final examination 80% and selected parts of the exercises 20%. The results for the constituent parts are to be given in %-points, while the grade for the whole portfolio (course grade) is given by the letter grading system. Retake of examination may be given as an oral examination.
Compulsory assignmentsExercises
Recommended previous knowledgeThe course is based on ST0103 Statistics with Applications/TMA4240 Statistics/TMA4245 Statistics, or equivalent.
3
Contents (preliminary list)Hypotheses testing, simple and multiple linear regression, residual plots and selection of variables, transformations, design of experiments, 2^k experiments and fractions of these. Special designs. Graphical methods. Error propagation formula. Analysis of variance, statistical process control, contingency tables and nonparametric methods. Use of statistical computer package, MINITAB.
LecturerProfessor Bo Lindqvist, Room 1129, Sentralbygg II, NTNU Gløshaugen Telephone: (735) 93532 Email: [email protected]
Teaching assistantStipendiat Håkon Toftaker, Room 1036, Sentralbygg II, NTNU GløshaugenTelephone (735) 91681Email: [email protected]
4
Teaching material
Main book:Walpole, Myers, Myers and Ye: "Probability and Statistics for Engineers and Scientists". Eighth Edition. Pearson International Edition.
Tables:”Tabeller og formler i statistikk”, 2. utgave. Tapir 2009.
MINITAB:Information is found on http://www.ntnu.no/adm/it/brukerstotte/programvare/minitab.
5
Weekly meetings
LecturesTuesdays 12.15 – 14.00 H3Thursdays 12.15 – 14.00 S4
ExercisesMondays 17.15 – 18.00 in H3
or a computer lab
6
Week Topic Chapter (WMMY) Exercise
2Introduction, motivation and repetition.
Descriptive measures and graphs. Normal plot.
(1-10) Particularly 8.1-8.7 1
3Two-sample case. Comparing variances.
F-distribution. Simple linear regression,
8.8, 9.8, 9.13, 10.8, 10.13, 10.18.(11.1-11.5) 11.6 – 11.12 2-3
4-5 Multiple linear regression 12.1 -12.7, 12.9-12.11 4-5
6-8 2k experiment and fractions thereof BHH 10 and 12 Alternatively, WMMY 15 6-8
9-10 Analysis of variance 13.1-13.4,13.6,13.8-13.10,13.13,13.15 14.1-14.4 9-10
11 Statistical process control 17.1-17.5 11
12 Chi-square tests and Contingency tables 10.14-10.16 12
13-14 (Tuesd) Easter vacation
14(Thursd)-15 Nonparametric statistics 16.1-16.3 12
16 Approximation of expectation and variance
17(Tuesd) Repetition
Preliminary curriculum, lecturing and progress plan
7
8
9
10
11
12
13
14
15
The compulsory project: Example
16
17
18
Introduction to course
TMA 4240/45 Statistics and ST 0103: Probability theory + simple statistics
TMA 4255 (this course):A little probability + APPLICABLE and APPLIED statistics
The ”classical” statistical methods:
• Regression analysis• Design of experiments• Analysis of variance (ANOVA)• Analysis of discrete data (contingency tables)• Nonparametric methods
19
Why is statistics important in science and industry?
The book emphasizes ”the Japanese industrial miracle”:
•Use of statistical methods in design and production•Statistical thinking in all parts of the production
In 2000, the highly reputated international medical journal New England Journal of Medicine appointed
•Use of statistical methods
as one of the 11 most important medical advances throughout time
20
21
Originally: Statistics = ”collection and presentation of data”
Today much more:
•Design and collection of data from statistical investigations
•Modeling of the stochastic mechanisms behind the data
•Drawing conclusions about these mechanisms, based on the data
•Evaluation of the strength of the conclusions (variance, confidence interval, test power)
•Basic tool: Probability theory
22
Statistical investigations can be divided into two main types:
•Experimental studies based on design of experiments (DOE): Experiments are done under controlled conditions.
•Observational studies: When control of conditions are not possible.
23
Statistical studiesEksperimental studies:Clinical trialsComparison of drugs A og BTrial group of n persons
r drawn at random aregiven A
s drawn at random aregiven B
n-r-s (rest) get Placebo
Blind test: Patient does not know kind of drug
Double blind: Examining doctor does not know either
Observational studies:Epidemiological experiments• Smoking and cancer• Diet and coronary diseases
Cannot control the conditions; e.g. cannot force people to smoke/not smoke.
Difficulty in interpretation: May be unknown underlying causes which make the results biased (”confounding”). For example: A gene which increases the need for smoking, and at the same time influences chance of getting cancer.
24
Statistics in scientific investigations:
•Generate hypotheses
•Derive consequences
•See whether these are fulfilled in observations
•Generate new hypotheses
•Etc.
25
Particular matters for statistics in industry:
Product and process engineer:
Off-line: Controlled experiments aiming at optimizing production
Production: Register production data; use them to control production.
26
MINITAB 15: Rocket fuel example Stat > Basic Statistics > Display Descriptive Statistics
27
Stat > Basic Statistics > Graphical Summary
42403836
Median
Mean
42,041,541,040,540,039,539,0
1st Q uartile 39,275Median 41,0003rd Q uartile 42,000Maximum 42,600
39,136 41,984
39,266 42,037
1,369 3,634
A -Squared 0,59P-V alue 0,092
Mean 40,560StDev 1,991V ariance 3,963Skewness -1,55360Kurtosis 2,72795N 10
Minimum 35,900
A nderson-Darling Normality Test
95% C onfidence Interv al for Mean
95% C onfidence Interv al for Median
95% C onfidence Interv al for StDev95% Confidence Intervals
Summary for X
28
More plots: Rocket fuel data
42,341,440,539,638,737,836,936,0X
Dotplot of X
47,545,042,540,037,535,0
99
95
90
80
70
60504030
20
10
5
1
X
Perc
ent
Mean 40,56StDev 1,991N 10AD 0,589P-Value 0,092
Probability Plot of XNormal - 95% CI
45,042,540,037,535,0
100
80
60
40
20
0
X
Perc
ent
Mean 40,56StDev 1,991N 10
Empirical CDF of XNormal
29
30
31
Statistical inference with MINITAB: Rocket fuel data
Stat > Basic Statistics > 1-Sample Z Assume known standard deviation sigma=2
One-Sample Z: X
Test of mu = 40 vs not = 40The assumed standard deviation = 2
Variable N Mean StDev SE Mean 95% CI Z PX 10 40,560 1,991 0,632 (39,320; 41,800) 0,89 0,376
One-Sample Z: X
Test of mu = 40 vs > 40The assumed standard deviation = 2
95% LowerVariable N Mean StDev SE Mean Bound Z PX 10 40,560 1,991 0,632 39,520 0,89 0,188
32
Stat > Basic Statistics > 1-Sample t Assume unknown standard deviation sigma
One-Sample T: X
Test of mu = 40 vs not = 40
Variable N Mean StDev SE Mean 95% CI T PX 10 40,560 1,991 0,629 (39,136; 41,984) 0,89 0,397
One-Sample T: X
Test of mu = 40 vs > 40
95% LowerVariable N Mean StDev SE Mean Bound T PX 10 40,560 1,991 0,629 39,406 0,89 0,198
33
34
One and two sample tests concerning means
35
Example: The industrial experment
An experiment was performed on a manufacturing plant by making in sequence 10 batches using the standard production method (A), followed by 10 batches of a modified method (B). The results from these trials are given in the table on the next slide. What evidence do the data provide that method B is better than method A?
36
37
38
(Assuming equal variances:)
39
(Not assuming equal variances:)
40
Two sample t-test
41
Paired observations (example)
42
43
44
Met
ode
95% Bonferroni Confidence Intervals for StDevs
2
1
8765432
Met
ode
X
2
1
92908886848280
F-Test
0,390
Test Statistic 1,58P-Value 0,505
Levene's Test
Test Statistic 0,78P-Value
Test for Equal Variances for X
45
46
Regression analysisResponseDependent variable (WMMY)
Y
Explanatory variablePredictorCovariateIndependent variable (WMMY)Regressor (WMMY)
Goal: Describe Y as function of the xi,
47
Linear Regression
48
Typical data
49
EXAMPLE
Connection between stiffness (stivheit) and density (tettleik) of a tree product.
50
Plot of stiffness versus density
252015105
100000
80000
60000
40000
20000
0
tettleik
stiv
heit
Scatterplot of stivheit vs tettleik
51
52
Least squares regression line
252015105
100000
80000
60000
40000
20000
0
tettleik
stiv
heit
Scatterplot of stivheit vs tettleik
53
54
55
56
Regression line for logstiv ( = log(stivheit) )
252015105
11,5
11,0
10,5
10,0
9,5
9,0
8,5
tettleik
logs
tiv
Scatterplot of logstiv vs tettleik
57
58
59
Predicted values for new observations
60
61
62
63
64
65
66
Multiple Linear Regression
67
Example – Acid rain in Norwegian lakesData from a study of the influence of acid rain on Norwegian lakes, made in 1986. Totally 1005 lakes were studied. In this example are chosen 26 lakes, 16 from Telemark and 10 from Trøndelag.
Variable name
Meaning
x1 Content of SO4 (mg/l)
x2 Content of NO3 (mg/l)
x3 Content of Ca (mg/l)
x4 Content of latent aluminum (mu g/l)
x5 Content of organic substance (mg/l)
x6 Area of lake
x7 Site (0 = Telemark, 1 = Trøndelag)
y Measured pHz Fish status (1 = no damage, 2 = minor damage, 3 = major damage, 4 = no fish)
ROW x1 x2 x3 x4 x5 x6 x7 y z1 4.9 39 1.54 78 2.02 0.30 0 5.38 32 4.1 75 1.55 17 2.98 1.85 0 5.68 13 3.5 80 0.83 157 3.40 0.25 0 5.04 34 3.8 75 0.53 163 3.42 0.30 0 4.81 45 3 8 90 0 82 105 3 41 0 25 0 4 92 4
68
69
70
71
72
73
74
75
76
Model using x1,x2,x3
77
78
79
80
81
Simulation20 observations are simulatedfrom the modelY = 1 + ….WithEpsilon N(0,0.02^2)X1 uniform on (1,3)X2 is N(1,0.5^2)
82
The wrong model
Y = beta0 …..
is estimated by MINITAB
83
Normal plot looks at first OK, but has a systematic convex shape. Thehistogram is skew to the left, while the other plots are fairly OK.
84
The residuals vs x1 have an ”opposite U” shape, which suggests that model (1)is not correct. Residuals vs x2 look better, but there is possibly something wronghere, too.
85
The residual plot suggests thatx1 should be transformed. Weuse log x1 (which is of coursecorrect since we know theunderlying model. Estimate thegiven model in MINITAB
86
87
88
Example: Multicollinearity
89
90
91
92
Factorial Experiments
93
94
95
96
97
98
99
Back to two-factor experiment
100
DOE terminology – main effects:
101
Interaction effects
102
103
Three factors
z1 z2 z3 z12 z13 z23 z123
104
Two-factor interaction
105
Three-factor interaction
106
Or: using + and – in columns:
107
108
109
Cube plot
110
Four factors – example
111
Full Factorial Design
Factors: 4 Base Design: 4; 16Runs: 16 Replicates: 1Blocks: 1 Center pts (total): 0
All terms are free from aliasing.
112
Factorial Fit: Y versus A; B; C; D
Estimated Effects and Coefficients for Y (coded units)
Term Effect CoefConstant 72,250A -8,000 -4,000B 24,000 12,000C -2,250 -1,125D -5,500 -2,750A*B 1,000 0,500A*C 0,750 0,375A*D -0,000 -0,000B*C -1,250 -0,625B*D 4,500 2,250C*D -0,250 -0,125A*B*C -0,750 -0,375A*B*D 0,500 0,250A*C*D -0,250 -0,125B*C*D -0,750 -0,375A*B*C*D -0,250 -0,125
S = *
113
Analysis of Variance for Y (coded units)
Source DF Seq SS Adj SS Adj MS F PMain Effects 4 2701,25 2701,25 675,313 * *2-Way Interactions 6 93,75 93,75 15,625 * *3-Way Interactions 4 5,75 5,75 1,438 * *4-Way Interactions 1 0,25 0,25 0,250 * *Residual Error 0 * * *Total 15 2801,00
114
Effect
Perc
ent
2520151050-5-10
99
95
90
80
7060504030
20
10
5
1
Factor
D
NameA AB BC CD
Effect TypeNot SignificantSignificant
BD
D
B
A
Normal Probability Plot of the Effects(response is Y, Alpha = ,05)
Lenth's PSE = 1,125
115
Term
Effect
AD
CDACD
ABCDABD
ACABCBCD
ABBC
CBDDA
B
2520151050
2,89Factor
D
NameA AB BC CD
Pareto Chart of the Effects(response is Y, Alpha = ,05)
Lenth's PSE = 1,125
116
Mea
n of
Y
1-1
84
78
72
66
601-1
1-1
84
78
72
66
60
1-1
A B
C D
Main Effects Plot (data means) for Y
117
A
1-1 1-1 1-1
80
70
60
B
80
70
60
C
80
70
60
D
A-11
B-11
C-11
Interaction Plot (data means) for Y
118
Example: Three factors and replicate
119
Factorial Fit: Y versus A; B; C
Estimated Effects and Coefficients for Y (coded units)
Term Effect Coef SE Coef T PConstant 64,250 0,7071 90,86 0,000A 23,000 11,500 0,7071 16,26 0,000B -5,000 -2,500 0,7071 -3,54 0,008C 1,500 0,750 0,7071 1,06 0,320A*B 1,500 0,750 0,7071 1,06 0,320A*C 10,000 5,000 0,7071 7,07 0,000B*C 0,000 0,000 0,7071 0,00 1,000A*B*C 0,500 0,250 0,7071 0,35 0,733
S = 2,82843 R-Sq = 97,63% R-Sq(adj) = 95,55%
Analysis of Variance for Y (coded units)
Source DF Seq SS Adj SS Adj MS F PMain Effects 3 2225,00 2225,00 741,667 92,71 0,0002-Way Interactions 3 409,00 409,00 136,333 17,04 0,0013-Way Interactions 1 1,00 1,00 1,000 0,13 0,733Residual Error 8 64,00 64,00 8,000
Pure Error 8 64,00 64,00 8,000Total 15 2699,00
120
Term
Standardized Effect
BC
ABC
C
AB
B
AC
A
181614121086420
2,31Factor NameA AB BC C
Pareto Chart of the Standardized Effects(response is Y, Alpha = ,05)
121
Standardized Effect
Perc
ent
151050-5
99
95
90
80
70
60504030
20
10
5
1
Factor NameA AB BC C
Effect TypeNot SignificantSignificant
AC
B
A
Normal Probability Plot of the Standardized Effects(response is Y, Alpha = ,05)
122
Blocking in 2^k experimentsFull experiment:
Two blocks:Use column ABC as generator, i.e.Block 1 consists of experiments with ABC = -1Block 2 consists of experiments with ABC = 1
123
The interaction ABC is confounded (”mixed”) with the block effect. This means that the value of the estimated coefficient of ABC can bedue to both interaction effect and block-effect.
Suppose all Y in block 2 are increased by a value h. Then the estimatedeffect of ABC will increase by h. But one cannot know from observationswhether this is due to the interaction ABC or the block effect.
On the other hand, the estimated main effects A,B,C and the two-factorinteractions AB,AC,BC are not changed by the h. These are of mostimportance to estimate, so the choice of blocking seems reasonable.
124
Four blocks in 2^3 experiment
Need two columns of +/- to define 4 blocks. Turns out that the bestoption is to use two two-factor interactions, e.g. AB and AC (whichis default in MINITAB
Block 1: Experiments where AB = AC = -1Block 2: Experiments where AB = -1, AC = 1Block 3: Experiments where AB = 1, AC = -1Block 4: Experiments where AB = AC = 1
125
Block structure is as follows:
Interaction effects AB and AC are confounded with the block effect, since theyare generators for the blocks. In addition, their product AB*AC = AABC = BCis confounded with the block effect (Note: the BC column is constant within eachblock.
Adding h2 to block 2, h3 to block 3, h4 to block 4 does not change estimatedeffects of A,B,C, and also does not change the third order interaction ABC. However, e.g. AB will change by 2h3+2h4-2h2 and we do not know whetherthis is due to an interaction effect or blocking effect: This is CONFOUNDING.
126
How to determine which columns to use for blocking?
Idea: Try to leave estimates for main effects and low order interactionsunchanged by blocking.
Note: I = AA = BB = CC where I is a column of 1’s
Find the blocks for a 2^3 experiment using columns ABC and AC.The interaction between ABC and AC is
ABC*AC = AA*B*CC = B
which is a main effect, which hence is confounded with the block effect(in addition to ABC and AC)
127
128
Example obligatory project
129
130
131
132
MINITAB plots
133
Assuming third and fourth order interactions are 0
134
Interaction plots
135
136
137
138
139
140
From Exam in TMA4260 Industrial Statistics, december 2003, Exercise 2
A company decides to investigate the hardening process of a ballbearing production. The following four factors are chosen:
A: content of added carbonB: Hardening temperatureC: Hardening timeD: Cooling temperature.
Design and results are given below:
a) What is the generator and the defining relation of the design, and what is thedesign’s resolution? Write down the alias structure.
Find the estimates of the main effect of A and the interaction effect AC.
141
b) What is the variance of the main effect A and the interaction AC?
Assume that the st deviation sigma has been estimated from other experiments, by s = 0.312 with 9 degrees of freedom (in the exam, this had been done in Ex 1.)Use this estimate to find out whether the interaction AC is significantly different from 0 (i.e. ”active”) Use 5% significance level. What is the conclusion of the experimentso far?
142
The company is well satisfied with the results so far and they decide to carry outalso the other half fraction. The result of the other half fraction is given below.
143
Use this to find unconfounded estimates for the main effects and the two-factorinteractions.
Assume that one would like to estimate the variance of the effects from thehigher order interactions. Explain how this can be done, and find the estimate.Is it wise to include the four-factor interaction in this calculation? Why (not)?
Later, one of the operators that participated in the experiments asked whetherone could have carried out the first half fraction in (a) in two blocks. This would,he said, have simplified considerably the performance of the experiments. What answer would you give to the operator?
144
From Exam in SIF 5066 Experimental design and…, May 2003, Exercise 1
A company making ballbearings experienced problems with the lifetimes of theproducts. In an experiments that they carried out they considered the factors
A: type of ball – standard (-) or modified (+)B: type of cage - standard (-) or modified (+)C: type of lubricate - standard (-) or modified (+)D: quantity of lubricate – normal (-) or large (+)
The repsonse was the lifetime of the ballbearing in an accelerated life testing experiment. The results are given on the next page.
145
A: type of ball B: type of cageC: type of lubricateD: quantity of lubricate
a) What type of experiment is this? What is the defining relation? What is theresolution? Calculate estimates of the main effect of A and the two-factorinteraction AB.
b) Estimated contrasts for B,C,D,AC,AD are, respectively, 0.60, 0.31, 0.22, -0.11,-0.01. What can you say about the estimated effects for CD, BD, BC. BCD, ACD,ABD,ABC?Assume that factors C and D do not influence the response. Explain why this is then a 2^2 experiment with replicate. Calculate an estimate for the variance of theeffects, and find out whether A, B and AB are now significant.
c) Give an interpretation of the results. The experiment was in fact carried out in twoblocks, where experiments 1-4 was one block and 5-8 the other. How is thisblocking constructed? How will we need to modify the analysis of significance in (b)?(Assume again that C,D do not influence the response)