1
Basic Experimentation
Notes developed by Ken LulayMechanical EngineeringUniversity of PortlandJuly 2008
2
Objectives
You will be able to: Understand basic experiment “vocabulary” Design and analyze a single variable
experiment (2 level factor) Design and analyze a multi-variable
experiment (2 level factors)
3
Basic Experimentation - Overview
Experiment Basics
Single variable experiments – design and analysis
Multi-variable experiments – design and analysis
4
Overview of Experiment Basics
Differences between testing and experimenting
Experimental variables Errors: systematic and random
5
Experimenting and testing…
…both require obtaining data (taking measurements), but…
…what are they and how are they different?
6
Testing
Testing may involve investigating only one set of conditions.
Usually evaluating performance Example: determine strength of a material
May be a standardized test (ASMT, ISO) Often has pass/fail criteria
Does it meet specifications or not? We will not be discussing “testing”
7
Experimentation
Performed to increase knowledge how things perform under differing conditions
Vary the input to determine the response Requires more than one set of conditions
(design points)
Evaluate “better/worse” (not pass/fail)
8
Experimentation & Testing
A BIG difference:
Tests are often routine Same tests done daily! Analogous to daily commuting to work
Experiments are “unique” Usually done only once! Therefore, require more careful planning! Analogous to a vacation trip
9
Variables
Variables are physical quantities that may or may not affect the results of an experiment or test.
Several types of variables are associated with any test and experiment:
Controlled or Extraneous Controlled variable are held constant or intentionally
manipulated (changed) during an experiment. Extraneous variables are not controlled. They are
generally assumed to have no effect on the response (ex: ambient room temperature)
10
Variables
Dependent or Independent The magnitude (value) of dependent variables are
dependent upon other variables whereas the magnitudes of independent variables are not
Ex: in an experiment to determine the effect of temperature change on the toughness of AISI 1045 steel, temperature would be an independent variable and toughness would be a dependent variable
Continuous or Discrete (a.k.a Categorical) Discrete variables cannot take on a continuous range
of values. Ex: Red/Green; Company A/Company B. Continuous variables can take on a continuous range.
Ex: temperature, toughness, force
11
Terminology
Factor - an independent variable in an experiment - factor levels are intentionally varied in an experiment to see what the effect is on the response.
Factor Level - the target value of the factor. Example: pressure may be set to two levels: 0.5 Atm, and 1.0 Atm
Response - the thing to be measured. Example, if you want to determine the yield strength at different temperatures, the yield strength is the response.
12
Variables and Levels
Proper selection of appropriate variables and their levels is not trivial but is critical
Selecting proper factors and levels is worth the effort. Don’t rush this step.
Differences in factor levels: Factor levels must be “well separated” Far enough apart to be “different” (produce-ably and
measurably) Not too far apart to be “unreasonable” (non-linear
responses can be an issue – may miss the optimum)
13
Purpose of experiments?
The sole purpose of our experiments will be to answer the following questions:
Does changing one or more factor have a statistically significant effect on the response(s)? And if so, which factors appear to have the most significant effect?
14
Practice
A materials engineer wants to study the effect of molybdenum content in a particular high alloy steel on the yield strength at various temperatures.
For this experiment: Define the factors and their levels Define the response Identify “all” variables and classify them
controlled/extraneous discrete/continuous dependent/independent
15
Practice (“answers”)
Factors (controlled): Molybdenum content; levels: 5.1% and 5.2%? Test temperature; levels: -50F, 1000F?
Also control: Test bar geometry, chemistry (other than Mo),
strain rate, measurement methods and systems, test methods and systems,…
Extraneous: humidity, … Dependent: yield strength (response)
16
Errors
Errors (measurement variation) are due to a number of factors:
Measurement error error = measured value - true value
Changes in test specimen Ex: one specimen has slightly larger diameter
Changes in environment Ex: ambient temperature increase
Et cetera
17
Errors
The “true” value is the value one would obtain with a perfect measurement.
The true value is never known in an experiment
Therefore, error can never be known exactly, it can only be estimated using statistical analysis.
Errors are inherent in measuring devices and caused by uncontrollable variations within the experiment.
18
Systematic and Random Errors
In any experiment, two types of error can exist:
Systematic Random
19
Systematic Errors
Caused by underlying factors which affect the results in a “consistent/reproducible” and sometime “knowable” way
Sometimes referred to as “bias” Not random DANGER: can lead to false conclusions!
Discuss this now, but example to follow later Can be managed (reduced effects) by
properly designed experiments (randomizing the test conditions).
20
Causes of Systematic Errors
Unknown changes during the experiment temperature, procedures, equipment, etc.
Different batches of material or samples
Et cetera
21
Random Errors
Show no reproducible pattern – they are random.
Sometimes referred to as “noise.” Typically have normal distribution (bell
shaped)averaging several readings can reduce random errors.
22
Practice
Consider the previous example (experiment to determine effect of varying molybdenum content and temperature on yield strength)
Make a list of possible systematic errors and random errors for the design on next slide…
23
PracticeRun Moly Temp.
1 2% 50F
2 2% 50F
3 2% 50F
4 10% 50F
5 10% 50F
6 10% 50F
7 2% 150F
8 2% 150F
9 2% 150F
10 10% 150F
11 10% 150F
12 10% 150F
•The Experiment:
–Two batches of steel: 2wt%Mo & 10wt%Mo
–Test bars are machined by outside company
–Two test temperatures: 50F, 150F
PRACTICE: Make a list of possible systematic errors and random errors
24
Practice (“answers”)
Possible systematic errors: Batches of steel (chemistry variation of other elements)
How could this effect be mitigated? Machining of specimens (did moly content affect
machining quality? Were specimens machined in batches with different diameters?)
Temperature drift during testing (maybe from 52F towards 48F, and from 152F towards 148)?
Variation between beginning and end of test (measurement systems, operator, test equipment, test procedures…)
How could systematic errors have been reduced?
25
Practice (“answers”)
Possible random errors: Measurement errors Diameter of bars (maybe random) Load cell variation Others?
26
Review of Terminology
Do Exercise 1 (definitions).
27
Single Variable Experiments to follow
28
Overview of Single Variable Experiments
Basic Design of Experiments (DOE) Example of “how not to” Statistics and t-testing Hypothesis testing Confidence Intervals
29
Design of Experiments (DOE)
By careful design, errors can be mitigated Systematic errors are mitigated by randomizing the test
conditions (randomized run order) Random errors are mitigated by increasing the number
of data points
Design is a compromise of competing criteria: Cost, time, availability of equipment, etc. Control over variables Importance of results and conclusion
CAREFUL PLANNING is REQUIRED! Let’s look at a basic example…
30
Example:Single Variable Experiment
Wacky Engineer, a new employee at ASKO, believes that the color of paint applied to a tensile bar can affect the strength.
Let’s take a look at this experiment…
31
Single Variable Experiment
Determine if paint color affects strength of tensile bars Factor 1: paint color Levels: Red, Green Other controlled variables: test specimen geometry
and material (constant) Response: yield strength of bar Results: Red = 81.9ksi, Green = 80.2ksi
Did color of paint have an effect? Not a well thought out experiment We need more and better data…
32
Single Variable Experiment
New Experiment with more data: paint five bars red and five green
Red paint is available, green paint is on backorder.
Your boss really wants data soon! Test facility is available, so show progress:
Paint and test red bars! Green paint arrives, complete the testing!
33
Single Variable Experiment
The results:R: 80.3, 81.2, 82.1, 83.1, 82.2; Ave=81.9G: 78.2, 82.1, 80.8, 81.6, 81.1; Ave=80.2
The red bars were stronger on average. Same operator did all testing. Red bars were the first tensile bars he’s ever tested. Did color of paint have an effect? This is another poorly thought out experiment. What are some problems with this experiment?
34
Another, Better Example
Re-do the prior experiment, but randomize Randomize by using the following run order: R, G, G, R, G, G, R, R, G, R
why randomize?
Why would the following run order not be “OK”?
R, G, R, G, R, G, R, G, R, G
35
Better Example
The randomized run order results:R: 80.3, 81.2, 82.1, 83.1, 82.2; Ave=81.9
G: 78.2, 82.1, 79.8, 79.6, 81.1; Ave=80.2 R & G averages are different but did color of
paint really have an effect? Averages are only part of the answer
“Statistically significant” difference depends upon both the averages and the variation.
36
Plot the Data
79 81 83
• Looks like Red paint increased the strength!• Will your boss believe this?• How certain are you that the effect is real?• How likely is this to be a “fluke”?
37
Need some statistical stuff…
38
Probability Distribution
Assume distribution is “normal”!!! Measurements are a sample of the total We can never be 100% certain about
experimental results (variation, error). Can only estimate “likelihood” or “probability”
a b
f(x)
b
adxxfbxa )()Pr(
39
t-test
Comparing the averages is NOT sufficient!
The best way to answer “are they different” is with the t-test.
The t-test incorporates both the deviation of the data as well as the means.
40
t-test – what does it do?
Consider two sets of sampled data Are their true means likely different?
What about these two sets?
t-test will help us decide
Both sets havesame averages
41
Basic Statistics = true mean X = estimated mean based on finite sample size = true standard deviation S = estimated standard deviated based on the finite sample size n = number of samples
xi is the value of the ith sample (Equation 1)
n
iix
nX
1
1
n
ii Xx
nS
1
2)(1
1 (Equation 2)
42
Basic Statistics, Continued
Note: X is an estimate of the actual mean (). It becomes closer to with increasing sample size, n. X itself is a random sample of the true mean, .
For normally distributed data:68.3% of all data will be with in +/- 1 95.4% of all data will be with in +/- 2 99.7% of all data will be with in +/- 3
43
Hypothesis Testing
We want to determine if color of paint had an effect on strength (Red vs. Green, prior example).
Hypothesize there is no effect due to paint color (this is the so-called “null hypothesis” or H0=0). In other words, we claim that:
Red = Green
We have sample means (XR=81.9, XG=80.2) which
are estimates for the true means (Red, Green) but we
can never know the true means exactly.
44
Statistics
Assume the deviations are the same (R = G) “Pool” the deviations:
For our Paint Color experiment:
nR = nG = 5, SR2 = 1.33; SG
2 = 2.23
Sp2 = {(5-1)*1.33+ (5-1)*2.23} / {(5-1) + (5-1)}
Sp2 = 1.78
)1()1(
)1()1( 222
GR
GGRRp nn
SnSnS (Equation 3)
45
t-test
We now define t0, which is from the t-distribution (don’t worry about what that means):
For our example:
t0 = ABS{81.9 – 80.2} / {1.78 (1/5 + 1/5)}1/2
t0 = 2.06
)11
(20
GRp
GR
nnS
XXt
(Equation 4)
46
t-test
So what is this “t0” number?
Notice the “effect” (difference between the two samples) is in the numerator, the variation (“noise”) is in the denominator.
The larger t0 is the greater the probability that the effect (difference) is real. How large is large?
)11
(20
GRp
GR
nnS
XXt
“Effect”
“Error” or “variance”
47
t-test
To determine the t-distribution value we need to know the degrees of freedom and select a confidence level
Determine the degree of freedom in our experimentDOF = (nR - 1) + (nG - 1) = (5 - 1) + (5 - 1) = 8
We need to compare t0 calculate with tabulated values from t-distribution with corresponding degrees of freedom (8) at some level of confidence
Confidence level is our choice, typically 95% or 99%.
48
t-distribution Table
We select 95% confidence as our criterion
For 95% confidence interval, = 0.05
There are 8 degrees of freedom in this experiment
From t-distribution: t/2, DOF = t0.05/2, 8 = 2.31
t-distribution values are obtained from tables in most statistics/experimentation books.
Note, t/2 – means we are using 2-sided or 2-tailed test which is
appropriate for the hypothesis of R =G. If we were to ask the question
is R > G, then we would use single-sided t-table (t/1, DOF).
49
t-test
In our paint example t0 < t/2, DOF (2.06 < 2.31)
t0 is too small to reject the null hypothesis at 95% confidence.
Therefore, we accept the null hypothesis
(Red = Green).
This does not mean we are 95% confident that the bars painted red were equal to the green. It means we cannot say with confidence that they are different. Next slide…
50
95% Confident?
“Cannot say they are different” is not equal to “saying they are the same”A “well mixed” box of 100 apples: any of the apples
can be either red or green. Hypothesis: number of red = number of green
New box, pull 30 out: 1 is red, 29 are green – would you reject the null hypothesis?
Null hypothesis: 50 are red, 50 are green Pull 30 apples out: 15 red, 15 are green – would
you reject the null hypothesis? Pull 99 apples: 49 are red, 50 are green…?
51
Exercise – t-test Task: conduct an experiment to determine if there is a
difference in the fatigue life between two brands of paperclips.
Fatigue life is defined in this experiment as number of times the clip can be bent back and forth 90 degrees. One 90 degree bend is one fatigue cycle.
If there are 10 or more people in the class, split the class in half (effectively conducting two identical experiments with about 5 data points per paperclip brand in each experiment).
Each person in the class should break one of each brand and record their results.
Use the worksheets in back of this book.
52
Pairing – a special condition t-test
Determining relative magnitudes of the “effect” and “noise” is foundational for statistical analysis of experimental data.
“Noise” comes from many sources: differing batches of specimens, differing test or measurement apparatus, operator differences, etc.
If we can “filter out” noise, we would be able to perform a more effective analysis.
53
Pairing – a special condition t-test
If there is a single source of noise that we can identify and control, we may be able to “pair” the data.
For example, if we want to test the wear life of a new alloy, we may conduct an experiment to compare the life of the new alloy with a traditional alloy.
Put several of each part on buckets in the field. Due to differing loading conditions, one would expect
there to be large variability in wear from one bucket to another. Therefore, the bucket variation will introduce a large amount of noise.
54
Pairing – a special condition t-test
Pairing: put one sample of each alloy on each bucket (alternating location (left-right) of the two alloys from bucket to bucket). Determine the difference in life on each bucket between the two alloys.
Since each alloy on a given bucket presumably will experience similar loading, pairing will effectively “filter out” the noise contributed by bucket-to-bucket variation.
The null hypothesis now becomes D = 0, where D is the difference in means between the two groups (alloys, in this example).
55
Pairing – a special condition t-test
For paired t-test
where
D
D
D
nS
Xt
20
Dn
iDi
DD Xd
nS
1
22 )(1
1
(Equation 5)
XD is the average of the differences, nD is the
number of pairs of data, di is the individual
data (difference).
56
Exercise - Pairing
What about our paperclip experiment? A potentially large source of error was
operator-to-operator variability. Each operator tested one of each paperclip,
therefore, we can analyze using pairing! How lucky!
Using the worksheet in the back of this booklet, re-analyze the paperclip fatigue data using pairing (Exercise 3).
A bit more on pairing…
Situation: Your company produces optical supplies. The quality of optical
mirrors is not satisfactory. You believe that the problem has to do with grinding speed.
Given: *Your company has 12 grinding machines to produce optical mirrors. *The machines are numbered 1-12, but are randomly placed throughout the
shop. *You are allowed to use a total of 24 mirrors in your experiment. Task: Design an experiment (i.e. fill in the table below) to determine which cutting
speed is better, Fast or Slow. At a maximum, you will have 24 runs. You are not trying to evaluate grinding machine performance, so you may use
any number of grinding machines (1,2,...12).
57
Run Grinder Speed Run Grinder Speed1 2 Slow 13 4 Slow2 3 Slow 14 7 Fast3 11 Fast 15 5 Slow4 6 Slow 16 6 Fast5 1 Slow 17 11 Slow6 5 Fast 18 4 Fast7 10 Slow 19 9 Fast8 8 Fast 20 10 Fast9 2 Fast 21 8 Slow
10 12 Slow 22 3 Fast11 12 Fast 23 1 Fast12 9 Slow 24 7 Slow
58
Grinder Response for Slow Response for Fast Difference
1 1.22 1.96 0.742 1.63 1.80 0.173 2.42 3.01 0.594 3.12 3.05 -0.075 0.76 1.23 0.476 4.23 4.89 0.667 1.58 1.30 -0.288 2.81 3.17 0.369 2.19 2.94 0.75
10 3.75 3.90 0.1511 1.66 2.28 0.6212 3.80 4.40 0.6
AVERAGE XL=2.431 XH=2.828 XD=0.3967Deviation sL=1.118 SH=1.171 sD=0.335Samples nL=12 nH=12 nD=12
59
60
0 1 2 3 4 5
Mirror Quality
Slow Speed
Fast Speed
Figure 1 - data from Fast and Slow grinding speeds.
61
-0.4 -0.2 0 0.2 0.4 0.6 0.8
Mirror Quality Difference (Fast - Slow) for Each Machine
difference
Figure 2 - difference between Fast and Slow data on each machine.
62
0 1 2 3 4 5
Mirror Quality
Slow Speed
Fast Speed
Figure 3 - data from Fast and Slow grinding speeds, with one additional data point for
each.
New data
New data
Add 2 new data points created on one machine…do they appear to be “reasonable?”
63
-0.5 0 0.5 1 1.5 2 2.5 3 3.5 4
Figure 4 - difference between Fast and Slow data on each machine with addition point.
Difference between new data points
But they are produced on the same machine, so does the difference appear to be “reasonable?”
64
Pairing – conclusion
Pairing is not always an option (need to be able to identify a single source of noise and then introduce one of each group to the precise same noise).
If it is an option – do it! There is no cost other than planning for it. It will increase the power of the conclusion Not uncommon to fail to reject the null hypothesis
using the t-test alone, but rejecting it using pairing analysis because of its increased power.
65
Can the t-test mislead us?
Yes! We can only make statements about probability!
Also, for the t-test to be valid, the data must have normal distribution.
There are two types of errors that can be made with hypothesis tests (next slide, please)…
66
Hypothesis Errors
Type I The probability of erroneously rejecting the null
hypothesis Also known as the level of significance (equals )
Type II The probability of erroneously accepting the null
hypothesis The power of the experiment increases with increased
number of data points (less likely to make a type II error)
These are independent, not complimentary (think about the previous “apple” example.)
67
Confidence Intervals
Rather than asking the question “are they different” we may want to ask the question “how different are they”
Confidence Intervals help us answer that question
68
Confidence Intervals What is the likely range of differences between the means of two groups (A-B)?
)11
()( 2,2/
BApDOFBA nn
StXXCI
The interval or range is:
Where t/2, DOF is based on the level of confidence, and DOF = nA + nB - 2
(Eq’n 6)
69
Confidence Intervals
For our example, for 95% confidence intervals, we have: DOF = 5 + 5 – 2 = 8 (nG = nR = 5)
t0.05/2,8 = 2.31 (from t-distribution tables)
XR=81.9, XG=80.2, Sp2 = 1.78 (from previous)
= (81.9-80.2)+/-(2.31){1.78(1/5+1/5)}1/2 = 1.7 +/- 1.95
)11
()( 2,2/
GRpDOFGR nn
StXXCI
70
Results
• The 95% confidence interval is:
1.7 - 1.95 < R -G < 1.7+1.95
Which is: -0.25 < R – G < 3.65
• We are 95% confident that the true difference in means of these two groups lies somewhere within this interval
(-0.25 to 3.65).
• Since the interval contains zero, we failed to reject the null hypothesis.
• Would the range increase or decrease for higher levels of confidence?
71
So what?
The confidence interval was determined to be: -0.25 < R – G < 3.65 (95% level)
What if we consider a difference of 2ksi or greater to have engineering significance, what next?
What if we consider a difference of 5ksi or greater to have engineering significance, what next?
72
Exercise – confidence interval
Using the worksheet in the back (Exercise 4) and the data already obtained, determine the confidence interval for difference in fatigue life (number of bends until fracture) of two paperclip brands (A and B).
73
Randomize run order (remember, systematic errors = bad)
t-test is used to evaluate “is there an effect?” Pairing can be more powerful
use it if possible Confidence Interval determines likely range of
the difference (R – G )
Summary for Single Variable Experiment
74
We’ve looked at a single variable experiment.
What about more complicated conditions experiments with two or more variables…
What’s next?
75
Multi-Variable Experiments to follow
76
Overview of Multi-Variable Experiments
“One variable at a time” approach Interactions (what are they?) Terminology Balanced design (what’s this?) Factorial Experiments Practice (optimize a “manufacturing” process)
77
Purpose of experiments?
Remember, the purpose of experiments is to answer the question “does changing one or more factors have an effect on the response.”
Our job is to answer that question using the limited resources available as well as possible.
78
Multi-Variables Using “One Variable at a Time” Approach
“One variable at a time” Very basic experiment Seems intuitive and simple
Reality: Difficult to draw meaningful conclusions Poor use of resources Avoid these types of experiments! Example to follow illustrates why
79
One Variable at a Time
Example: Determine optimal conditions for the following machining process:
Factors: Tool condition (levels: dull or sharp) Cutting depth (levels: 0.005” or 0.010”) Cutting speed (levels: 500rpm and 1000rpm)
Response: surface finish
80
One Variable at a Time, Design
Test conditions: Run 1, “baseline” or “control”
sharp, 0.005”, 500rpm Run 2, vary the tool condition
dull, 0.005”, 500rpm Run 3: vary the depth
sharp, 0.010”, 500rpm Run 4: vary the speed
sharp, 0.005”, 1000rpm
81
One Variable at a Time, Results
Run Tool Depth Speed Results
(surface finish)
1 Sharp Low Slow 140rms
2 Dull Low Slow 190rms
3 Sharp Deep Slow 120rms
4 Sharp Low Fast 90rms
82
One Variable at a Time, Conclusion?
Run 4 produced the best result, but…
How much random error was present?
Did systematic error influence the results?
Best set of variables maybe: Sharp tool? Deep cut? Fast?
Run
1 (base) 140
2 (dull) 190
3 (deep) 120
4 (fast) 90
83
One Variable at a Time, Conclusion
We have only one data point for conditions of “dull”, “deep”, “fast”, but three data points for “sharp”, “low”, “slow”
There is no way to estimate errors (most conditions were tested only once) Without estimating the errors, it is difficult to draw
valid conclusions. Did not test the “best conditions” together
This is “okay” if there are no interactions Could conduct another experiment to
validate – but wouldn’t it have been better to do a complete job the first time?
84
We’ve identified that there are problems with the “one variable at a time” approach.
Before considering better alternatives, we need to understand “interactions.”
Interactions?
85
What are Interactions?
An interaction is when changing one factor influences how a different factor will affect the response. Clear?
Example: Conduct an experiment to determine which is more effective at keeping your shirt dry in the rain: an umbrella or a raincoat.
Experiment: 2 factors: Factor 1, “weather”: rain with wind, rain with no wind Factor 2 , “tool”: umbrella, raincoat Response: “wetness”
86
Interaction Example
No wind: raincoat and umbrella were effective.
Wind: the umbrella was not effective but coat was.
There is an “interaction” between “weather” and “tool”
Run Weather Tool Wetness
1 Wind Umbr 80%
2 Wind Coat 20%
3 No wind Umbr 30%
4 No wind Coat 20%
87
What would no interaction “look like?” Contrast the previous with the following “no
interaction” results…
88
Run Weather Tool Wetness
1 Wind Umbr 90%
2 Wind Coat 30%
3 No wind Umbr 80%
4 No wind Coat 20%
Run Weather Tool Wetness
1 Wind Umbr 90%
2 Wind Coat 80%
3 No wind Umbr 30%
4 No wind Coat 20%
No interaction(“tool” had an effect: you’ll get wet if you use an umbrella)
No interaction(“weather” had an effect: you’ll get wet if it’s windy)
89
No Wind
Wind
Coat
Umbrella
Res
pons
e (w
etne
ss)
No Wind
Wind
Coat
Umbrella
Res
pons
e (w
etne
ss)
No Wind
Wind
Coat
Umbrella
Res
pons
e (w
etne
ss)
Interaction No interaction No interaction
(Response not parallel) (Response parallel) (Response parallel)
Interactions are easier to see by plotting results.Our three different scenarios:
90
Alternative Interaction Plot
Run Weather Tool Interaction Wetness
1 Wind Umbr + 90%
2 Wind Coat - 80%
3 No wind Umbr - 30%
4 No wind Coat + 20%
No Wind
Wind
Coat
Umbrella
Res
pons
e (w
etne
ss)
Plot all 4 interaction values and fit a trend line. If trendline is flat, there isno interaction. See “exercise” in back for more completediscussion.
- +
Res
pons
e (w
etne
ss)
Interaction
91
Exercise - interactions
Determine if interactions exist in the results shown in the exercise in the back of the booklet (Exercise 5).
92
End of Story for“One Variable at a Time”
The “machining” example above did not evaluate interactions…
…We can not determine what best set of conditions are.
And not only are we not confident in the results, we have no idea of how “not confident” we are!
93
What’s next?
We’re almost ready for some really fun stuff…
…but first, some terminology…
…and then, explain what “balanced” designs mean.
Then we can have fun with designing experiments “the right way!”
94
Terminology
Repetition - measuring the same response more than once (or taking another data point) without resetting up the experimental conditions. Decreases measurement errors to a limited degree.
Replication - requires completely redoing the experimental conditions. In other words, setting up the conditions as identically as possible to produce another measurement. Very important to estimate the experimental error. It shows the effects of set-up, and other unknown
extraneous variables. Replication is NOT the same as repetition,
although they sound similar.
95
Terminology, cont.
Run - a set of experimental test conditions. All factors are set to specific levels. If I want to measure the boiling point at three pressure levels, I need at least three runs - one with the pressure at each of the 3 levels.
Treatment (design point) - a set of
experimental conditions. One treatment is conducted each run, but treatments may be replicated in an experiment (may occur more than once).
96
Repetition or Replication?
Consider an experiment shown at right.
How would this experiment be conducted differently if it were to have 2 replicates compared to 2 repetitions?
Run Tool Sharpness
1 Sharp
2 Sharp
3 Dull
4 Dull
97
Designed Experiments(Design of Experiments, DOE’s)
Statistically based methodology of conducting and analyzing experiments
Interactions can be evaluated
Systematic error can be mitigated by randomization
Random error (noise) is mitigated by "balanced" designs since each variable is tested at different levels multiple times.
Let’s explain “balanced” design…
98
Balanced Design - What it really means
Each factor is tested an equal number of times at each level
For each factor setting, all of the other factors are set to each of their levels an equal number of times.
The variation of all the other factors does not bias the results.
Balanced designs do not necessarily test all possible conditions.
Need an example to understand “balanced”…
99
Prior Machining Example
The “One Variable at a Time” example was not a balanced design.
One level of each variable wastested 3 times, the other levelwas tested only once. Run Tool Depth Speed
1 Sharp Low Slow
2 Dull Low Slow
3 Sharp Deep Slow
4 Sharp Low Fast
100
Example of Balanced Designs Consider the “machining experiment”:
3 factors, 2 levels each
Tool: sharp, dull
Depth: deep, low
Speed: fast, slow
To run every possible combination would require 2f runs where f is the number of factors (f=3, 23 = 8).
…but we don’t need all 8 conditions for a balanced design…
Balanced means…well, we need an example…
101
Example of Balanced Designs
Run Tool Depth Speed
1 Dull Low Slow
2 Dull Deep Fast
3 Sharp Deep Slow
4 Sharp Low Fast
For each level of one factor, the other factors are tested an equal number of times at each level.
Ex: For dull tool, depth is low once and deep once, speed is slow once and fast once, et cetera.
This is a balanced design:
102
Contrast with not balanced
Run Tool Depth Speed
1 Dull Deep Slow
2 Dull Deep Fast
3 Sharp Low Slow
4 Sharp Low Fast
NOT BALANCED!All levels tested the same number of times as previous example (twice), but…if tool is dull, then depth is always deep
103
Run Tool Depth Speed
1 Dull Low Slow
2 Dull Deep Fast
3 Sharp Deep Slow
4 Sharp Low Fast
Run Tool Depth Speed
1 Dull Deep Slow
2 Dull Deep Fast
3 Sharp Low Slow
4 Sharp Low Fast
Balanced:
Not balanced:
104
Exercise – balanced experiments
Complete Exercise 6 in the back of this booklet: Create a balanced experiment with two factors at two
levels each. Assume 8 runs (22 = 4 conditions) Do not randomize (for this practice)
Factor A: levels: + and –
Factor B: levels: + and –
Notice “+” and “-” are often used in DOE’s to signify a “high” and “low” level. These are called “coded” levels
105
Review
We’ve studied Single Variable experiments (t-test, pairing, Confidence Intervals)
We have an understanding of interactions
We have an understanding of “balanced” design
We are ready to study experiments with multiple factors (factorial experiments)
106
Factorial Experiments – “The Right Way”
We will consider only full factorial experiments (experiments where all possible combinations are tested).
We will limit our discussion to experiments with two levels per factor.
Non-linear results will not be detected The total number of possible combinations for
experiments with multiple factors, all with two levels is 2f, where f is the total number of factors (test variables).
107
2-Level Factorial Design MatrixDesign Point Factor 1 Factor 2 Factor 3 Factor 4
1 + + + +2 - + + +3 + - + +4 - - + +5 + + - +6 - + - +7 + - - +8 - - - +9 + + + -10 - + + -11 + - + -12 - - + -13 + + - -14 - + - -15 + - - -16 - - - -
22
23
24
21
108
Effects?
Remember, experiments answer the question “is there an effect caused by changing factor levels?”
t-test answers this by comparing the difference (effect) to the error (noise):
We can do something similar with multi-variable experiments. Example follows…
""
""0 noise
effectt
109
Example, 2 Factors
We will use an example to develop our understanding of design and analysis
Design an experiment with: Factor 1: Paint color; Levels: Red, Green Factor 2: Operator; Levels: Chris, Terry Response: Yield strength
Use: Full factorial (all combinations tested) 3 replicates (each condition tested 3 times)
110
Design
To estimate error we need at least 2 replicates (each condition is tested twice)
More replicates = better estimate We decide to have 3 replicates (each
condition (design point) tested 3 times) We need a balanced design
111
Design Matrix
Design Point
Factor 1 Factor 2 Factor 1 Factor 2
1 + + Red Chris
2 - + Green Chris
3 + - Red Terry
4 - - Green Terry
(coded levels, +/-) (Non-coded levels)
Factor 1, color: (+) = Red; (-) = GreenFactor 2, operator: (+) = Chris; (-) = Terry
112
Interactions
With this DOE we will be able to analyze the effects of interactions.
Interactions are treated as an independent factors in the analysis
Two-way interactions (review): The effect of one factor depends upon the
level of another Ex: you will stay dry if you use an umbrella
and no wind, but will stay dry if you use a raincoat regardless of wind.
113
Design MatrixDesign Point
1 2 1X2
1 + + +
2 - + -
3 + - -
4 - - +
The level of interaction between Factors 1 and 2 (1X2) is the “product” of coded levels Factor 1 and 2 { i.e. (+)*(+)=(+); (+)*(-)=(-); (-)*(-)=(+) }
114
The randomized run sheet is on next slide
Includes 3 replicates (each design point, or set of conditions, is tested 3 times)
115
Randomized Run Order
Run Design Point
Run Design Point
1 4 7 4
2 2 8 2
3 3 9 3
4 1 10 2
5 1 11 4
6 3 12 1
The design point defines the test conditions for the run (see previous slides)
116
DOE Results
The experiment was conducted following the prescribed randomized run order.
The next slide shows the re-organized results and calculates the means
We will plot the results Then we will step through the analysis…
117
Results
Response for the 3 replicates
Main factors (1, 2) and interaction (1X2)
Averages for the 3 replicates
Dsgn Pt 1 2 1X2 xi1 xi2 xi3 Xi (ave)1 + + + 82 84 83 83.02 - + - 81 85 84 83.33 + - - 89 90 88 89.04 - - + 88 91 92 90.3
118
We are concerned with the averages, not the individual data points (they vary due to noise)
Let’s plot the data…graphs are a good way to visualize results…
119
Plot Results of Factor 1
Factor 1(Color)
- +
Res
pons
e
80
90
Plot the response values (averages)against the factor level (“-” “+”).
The graph shows that the averageresponse when Factor 1 was “-”compared to “+” is not much different. Changing Factor 1 had little effect.
Dsgn Pt 1 Xi (ave)1 + 83.02 - 83.33 + 89.04 - 90.3
120
Results for Factor 2
The slope of the trend line between the (-) and (+) levels shows that Factor 2 had a large effect on the response.
Factor 2(Operator)
- +
Res
pons
e
80
90
Dsgn Pt 2 Xi (ave)1 + 83.02 + 83.33 - 89.04 - 90.3
121
Results for Interaction (1X2)
Factor 1X2(Interaction betweenFactors 1 and 2)
- +R
espo
nse
80
90
Again, a nearly level trend line indicates little effect due to this factor (1X2). In other words, there is little interaction between Factors 1 and 2.
Dsgn Pt 1X2 Xi (ave)1 + 83.02 - 83.33 - 89.04 + 90.3
122
More Rigor (Statistics!)
The graphs are useful in terms of giving us a qualitative sense of effects.
But as we’ve seen, “averages” are not sufficient.
We need a method to quantify our confidence in the effect.
Where are we going? t-test is where!
123
Remember the “t-test”?
The t-test is used to answer the critical question: is there a statistically significant effect or is the change caused by random noise?
In order to answer that question in any experiment, we must compare the “effect” with the “noise.”
We must determine both “effect” and “noise” We’ll start with “noise” (error).
124
Nomenclature
Let xij be the response of the jth replicate of treatment “i”
Let Xi be the average of all responses within the replicate “j”
Let XT be the average of all responses, total
Let k be the total number of test conditions (design points)
“i” goes from 1 to k.
Let ni be the number of replicates for treatment “i”.
Let N be the total number of tests
125
Sum of Squares (SS)
SStotal = SSwithin + SSbetween
2
11
2
11
2
1 1
)()()( T
ni
ji
k
ii
ni
jij
k
iT
k
i
ni
jij XXXxXx
SSwithin is due to random noise.
SSbetween is variation attributed to changing the factor levels
If SSbetween is large compared to SSwithin then the treatment had an effect
“Sum of Squares” is a measure of variance
(Equation 7)
126
Example showing how to determine for SSwithin-1 design point 1:
SSwithin-1 = (82-83)2 + (84-83)2 + (83-83)2 = 2.0
2
1
)( i
ni
jijithwithin XxSS
Calculate sum of squares within each treatment (design point) and include in the table.
This is the first step to determine the “noise.”
Dsgn Pt 1 2 1X2 xi1 xi2 xi3 Xi (ave) SS1 + + + 82 84 83 83.0 2.02 - + - 81 85 84 83.3 8.73 + - - 89 90 88 89.0 2.04 - - + 88 91 92 90.3 8.7
127
Sum of Squares
2
1
)( i
ni
jijithwithin XxSS
“i” goes from 1 to 4 (design points)and ni = 3 (number of replicatesfor the ith design point)
2
1 1
)( i
k
i
ni
jijwithin XxSS
k = 4 (design pts)
Dsgn Pt mean 1 2 1X2 Xi (ave) SS1 + + + + 83.0 2.02 + - + - 83.3 8.73 + + - - 89.0 2.04 + - - + 90.3 8.7
Total: 21.3
• Determine sum of squares for each design point (following example on previous slide)
Sum=21.3
128
Experimental “noise”
SSwithin (calculated above) is related to the experimental “noise”, but it is not what we use in the t-test.
What do we use? Next slide please…
129
“Noise” (error) for the experiment
The mean square error is given as:
mse2 = SSwithin/(N-2f); N=total number of data
points (12), f = number of factors (2);
mse2 = 21.3/(12-22) = 2.7 (SSwithin = 21.3)
The “standard error” is:
For our example: standard error: 9.012
7.2*4
Nmse24
130
The standard error (just calculated) is the same for all factors in the experiment – it is the “experimental noise.”
Remember, t0 is the ratio between “effect” and “noise”
t0 = effect / standard error
“Noise” (error) for the experiment
131
Effect
We’ve determined the standard error
But what was the effect of various factor levels? Determine the average response for each factor
at each level: Determine the average response when factor 1
was (+) and also when it was (-), then do this for factor 2, etc.
This procedure requires a balanced design
132
Determine the Effect, Step 1
Factor 1 was (+) for design points 1 and 3: 83.0 + 89.0 = 172.0Factor 1 was (-) for design points 2 and 4: 83.3 + 90.3 = 173.6
Notice, slight rounding error differences between table and hand calculations
Dsgn Pt 1 2 1X2 Xi (ave) SS1 + + + 83.0 2.02 - + - 83.3 8.73 + - - 89.0 2.04 - - + 90.3 8.7
Total: 21.3sum (+) 172.0 166.3 173.3sum (-) 173.7 179.3 172.3
133
The “effect” is the difference in averaged responses for (+) and (-) levels. For Factor 1:
Effect = ABS {sum(+) – sum(-)}/n+ = {172.0 – 173.6} / 2 = 0.8• n+ = number of (+) data points (2 in this example)• Remember, average does not tell the whole story! • t-test to the rescue!
Determining effect: Step 2 and Step 3
Dsgn Pt 1 2 1X2 Xi (ave) SS1 + + + 83.0 2.02 - + - 83.3 8.73 + - - 89.0 2.04 - - + 90.3 8.7
Total: 21.3sum (+) 172.0 166.3 173.3sum (-) 173.7 179.3 172.3difference -1.7 -13.0 1.0Effect 0.8 6.5 0.5
134
t-test Review
t0 is the ratio of the “effect” to the “noise”. The larger t0 is the greater the probability that the factor had a real effect.
In the previous table we calculated the effect of all three factors (1, 2, 1X2). The “effect” is the difference in averaged responses for (+) and (-) levels.
The noise has already been determined for our example (“standard error”).
135
t-test
t0 is equal to Effect / Standard error:
Calculate t0 for each factor (including interactions)
N
n
sumsum
tmse2
0
4
)()(
136
Example calculations…
N
n
sumsum
t20
4
)()(
For this experiment we’ve calculated the standard error {(42/N)1/2} to be 0.9.
For Factor 1, the effect is = {sum(+)-sum(-)}/n+ = {172.0-173.3} / 2 = 0.8
Also for Factor 1, t0 = 0.8/0.9 = 0.9
Determine t0 for all factors and interactions
137
t-distribution
We also need to determine the value from the t-distribution table:
Degrees of freedom= N – 2f
N = total number of observations (12)
f = number of factors (2)
DOF = 12 – 22 = 8
For 95% confidence, from a t-distribution table: t/2, DOF = t0.05/2, 8 = 2.31
This is the same for all factors. Enter in the table…
138
Dsgn Pt 1 2 1X2 Xi (ave) SS1 + + + 83.0 2.02 - + - 83.3 8.73 + - - 89.0 2.04 - - + 90.3 8.7
Total: 21.3sum (+) 172.0 166.3 173.3sum (-) 173.7 179.3 172.3difference -1.7 -13.0 1.0Effect 0.8 6.5 0.5Std Error 0.9 0.9 0.9t0 0.9 6.9 0.5t-0.05/2, 8 2.31 2.31 2.31Effect? no yes no
Results
ErrorStd
Effectt 0
139
Experiment Conclusion
The key points from the table:
Factor 1 Factor 2 Factor 1X2
t0 0.9 6.9 0.5
t0.05/2, 8 2.31 2.31 2.31
Effect?
(is t0> t0.05/2, 8?)
no yes no
Only Factor 2 had a statistically significant effect.
140
Factorial Experiment, Conclusion
The above example shows the basics of analyzing a factorial experiment.
DOE software will perform the analysis for you.
Usually, F-test is performed rather than t-test, but the concept is the same (they are equivalent).
We will do an experiment for practice, first, let’s talk about other aspects of an experiment…
141
Planning an Experiment
Okay, we now have some idea about DOE “Design” is only a small part of the picture To conduct an experiment properly, much
more is required. This typically includes most of the following…
142
Experiment Process Define the problem (write down a problem statement),
define the objective (purpose). Determine available resources Determine factors, levels, and response(s) Create the design (number of runs, run order, etc.) Obtain resources
$$$, measurement and test equipment, test specimens (have spares), personnel, etc.
Create a plan Determine schedule for personnel, equipment, etc. Create a run-sheet.
Save all used specimens, identify them. You may need to take another closer look at them later.
143
Practice Experiment
Problem: we like “good” popcorn – and we currently can’t make good popcorn.
Create an experiment to help solve this problem.
As a class, complete the next slide
144
Practice
As a class: Write a problem statement Write clear objective of experiment Determine Factors – only two – think carefully
about what you select. Determine factor levels (do not worry that
some combination of factor levels will produce bad popcorn – this is to be expected)
Determine response (may be more than 1). Next slide please…
145
Practice
Factors you may have considered: Time Power setting Placement of bag within the microwave Orientation of the bag Brands of popcorn Different microwaves
Are “time” and “power setting” independent? Could they be combined and called “energy input”?
Resources are limited – we want the most useful information possible.
146
Exercise – full factorial experiment
Break into smaller groups (about 5 per group) and design and conduct the experiment and analyze the results. Use the worksheet in the back of this booklet (Exercise 7).
After completion, discuss as a class.
147
One last thing
“Outliers” may be an issue in an experiment. Unfortunately, if only 1 or 2 data points are observed for a given set of experimental conditions, it is not possible to determine if an outlier exists. Even more detrimental, with few data points a single outlier can dramatically change the sample mean!
What to do? Always do a “reality check” – do the results seem reasonable? If not, it may be due to an outlier – OR it may not be a error in any form (your judgment may be off – no shame in being surprised by results).
Be careful about dismissing what you think is an outlier – it may not be!
148
Limitations to what we’ve done
We considered only experiments with: Assumed normal distributions All factors at 2 levels each
These factors can be discrete or continuous Response must be continuous, not discrete
At least 2 replicates We did not look at “censored” data (such as fatigue
data that is terminated after so many cycles even if there was no failure)
All Design Points were replicated an equal number of times
Full factorial (all possible combinations were tested) Next slide please…
149
Advanced Stuff – But Not Here, Not Now
There are more advanced concepts (and surprisingly, these are not necessarily much more complicated to design or analyze.)
150
Life beyond this course
Experiments do not actually require having a second replicate to estimate errors (talk to a statistician)
Fractionated experiments – experiments that not all possible combinations are tested. These are very beneficial if there is a large number of
factors (2f gets big fast!). The “cost” is lost knowledge regarding interactions.
Experiments can model non-linearity if more than 2 levels per factor are included
151
CONCLUSIONS
Experiments require planning! Randomize to mitigate systematic errors
No pain, no gain Select factors and their levels carefully
May want to “try out” levels (pre-experiment) before beginning a DOE
t-test helps answer “is there an effect” Pairing is a good thing – if possible
Full factorial designs are effective and efficient for multi-variable experiments
152
Happy Experimenting!