input modeling for simulation
TRANSCRIPT
-
7/28/2019 Input Modeling for Simulation
1/48
1
Input Modeling for Simulation
IEEM 313, Spring 2005
L. Jeff Hong
Dept of IEEM, HKUST
-
7/28/2019 Input Modeling for Simulation
2/48
2
Input Models
Input models represent the uncertainty in a stochastic
simulation. Random variates are generated based on the input models Simulation outputs are determined by the input models
The quality of the output is no better than the quality ofinputs.
Garbage In, Garbage Out! The quality of input models depends on the raw data, which
is often collected by people who dont know simulation
-
7/28/2019 Input Modeling for Simulation
3/48
3
Input Models
The fundamental requirements for an input model
are: It must be capable of representing the physical realities of
the process.
It must be easily tuned to the situation at hand. It must be amenable to random-variate generation.
There is no true model for any stochastic input. The
best that we can hope is to obtain an approximationthat yields useful results.
-
7/28/2019 Input Modeling for Simulation
4/48
4
Input Models
A key distinction in input modeling problems is the
presence or absence of data: When we have data, then we fita model to the data. Good
software is available for this.
When no data are available then we have to creatively usewhat we can get to construct an input model.
-
7/28/2019 Input Modeling for Simulation
5/48
5
Outline
Input modeling with data
Data collection Select candidate distributions
Fitting and checking
Arena Input Analyzer
Input modeling without data
Sources of information Incorporating expert opinion
-
7/28/2019 Input Modeling for Simulation
6/48
6
Input Modeling with Data
1. Collect data.
2. Select one or more candidate distributions, based onphysical characteristics of the process and graphicalexamination of the data.
3. Fit the distribution to the data (determine values for itsunknown parameters).
4. Check the fit to the data via tests and graphical analysis.
5. If the distribution does not fit, select another candidate andgo to 3, or use an empirical distribution.
-
7/28/2019 Input Modeling for Simulation
7/48
7
Suggestions on Data Collection
Plan ahead: begin by a practice or pre-observing session,watch for unusual circumstances
Analyze the data as it is being collected: check adequacy
Combine homogeneous data sets, e.g. successive timeperiods, during the same time period on successive days
Be aware of data censoring: the quantity is not observed in its
entirety, danger of leaving out extreme values
Collect input data, not performance data
-
7/28/2019 Input Modeling for Simulation
8/48
8
Selecting Distributions: Histogram
A frequency distribution or histogram isuseful in determining the shape of adistribution
The number of class intervals depends on:
The number of observations The dispersion of the data
Suggested: the square root of the sample size
If few data points are available: combineadjacent cells to eliminate the raggedappearance of the histogram
Same datawith differentinterval sizes
-
7/28/2019 Input Modeling for Simulation
9/48
9
Selecting Distributions: Histogram
Vehicle Arrival Example: # of vehicles arriving at an intersectionbetween 7-7:05 am was monitored for100random workdays.
There are ample data, so the histogram may have a cell foreach possible value in the data range
Arrivals per
Period Frequency0 12
1 10
2 19
3 17
4 105 8
6 7
7 5
8 5
9 3
10 3
11 1
Histogram of # of Arrivals during 7-7:05am
0
5
10
15
20
0 1 2 3 4 5 6 7 8 9 10 11
# of Arrivals
Frequ
ency
-
7/28/2019 Input Modeling for Simulation
10/48
10
Selecting Distributions: Physical Basis
Most probability distributions were invented to
represent a particular physical situation. If we know the physical basis for a distribution, then
we can match it to the situation we have to model
A number of examples follow
-
7/28/2019 Input Modeling for Simulation
11/48
11
binomial: Models the number of successes in n trials, when thetrials are independent with common success probability,p.Example: the number of defective components found in a lot of ncomponents.
negative binomial: Models the number of trials required toachieve ksuccesses. Example: the number of components thatwe must inspect to find 4 defective components.
Poisson: Models the number of independent events that occur in afixed amount of time or space. Ex: number of customers that arriveto a store during 1 hour, or number of defects found in 30 squaremeters of sheet metal.
normal: Models the distribution of a process that can be thought ofas the sum of a number of component processes. Ex: the time toassemble a product which is the sum of the times required for eachassembly operation.
-
7/28/2019 Input Modeling for Simulation
12/48
12
lognormal: Models the distribution of a process that can be thought ofas theproductof a number of component processes. Example: therate of return on an investment, when interest is compounded, is the
product of the returns for a number of periods. Also widely used to
model stock prices.
exponential: Models the time between independent events, or aprocess time which is memoryless. Example: the time to failure for a
system that has constant failure rate over time. Note: if the timebetween events is exponential, then the number of events is Poisson.
Erlang: The sum ofkidentical exponential random variables. A
special case of the gamma...
gamma: An extremely flexible distribution used to model nonnegativerandom variables.
-
7/28/2019 Input Modeling for Simulation
13/48
13
beta: An extremely flexible distribution used to model bounded
(fixed upper and lower limits) random variables.
Weibull: Models the time to failure (minimum of a number ofpossible causes); can model increasing or decreasing failure rate
hazard. Ex: the time to failure for a disk drive. discrete or continuous uniform: Models complete uncertainty,
since all outcomes are equally likely.
triangular: Models a process when only the minimum, most likelyand maximum values of the distribution are known. Ex: theminimum, most likely and maximum inflation rate we will have this
year.
empirical: Reuses the data themselves by making each observedvalue equally likely. Can be interpolated to obtain a continuousdistribution.
-
7/28/2019 Input Modeling for Simulation
14/48
14
Fitting
Determine the unknown parameters for the distribution. Forexample, if you believe the data is from normal distribution,then you need to determine the mean and variance of thedistribution.
Common methods for fitting distributions are maximumlikelihood, method of moments, and least squares.
While the method matters, the variability in the data often
overwhelms the differences in the estimators (see Section 9.3).
Remember: There is no true distribution just waiting to befound!
-
7/28/2019 Input Modeling for Simulation
15/48
-
7/28/2019 Input Modeling for Simulation
16/48
16
Goodness-of-fit Tests
In the test...
H0: the chosen distribution fits the dataH1: the chosen distribution does not fit the data
Thep-value of a test is the Type I error level
(significance) at which we wouldjust reject H0 for thegiven data. If the level is less thanp-value, we donot reject H0; otherwise, we reject H0.
Thus, a large (> 0.10)p-value supports H0 that thedistribution fits.
-
7/28/2019 Input Modeling for Simulation
17/48
17
Goodness-of-fit Tests
If there are little data, the goodness-of-fit test is likely
to accept any distributions. Why?
If there are lots of data, the goodness-of-fit test islikely to reject any distributions. Why?
-
7/28/2019 Input Modeling for Simulation
18/48
18
Chi-squared Test
A histogram-based of test
X
-
7/28/2019 Input Modeling for Simulation
19/48
19
Chi-square Test
Arrange the n observations into a kcells, the test statistics is:
which approximately follows the chi-square distribution with k-s-1 degrees
of freedom, where s = # of parameters of the hypothesized distributionestimated by the sample statistics.
Valid only for large sample size
Each cell has at least 5 observations for both Oiand Ei
Result of the test depends on grouping of the data
==
k
i i
ii
EEO
1
2
20
)(
-
7/28/2019 Input Modeling for Simulation
20/48
20
Chi-squared Test Vehicle arrival example (page 9), sample mean 3.64
H0: Data are Poisson distributed with mean 3.64
H1: Data are not Poisson distributed with mean 3.64
Degree of freedom is k-s-1 = 7-1-1 = 5and so thep-value is
0.00004. What is your conclusion?
!
)(
x
en
xnpE
x
i
=
=xi Observed Frequency, Oi Expected Frequency, Ei (Oi - Ei)2/Ei0 12 2.61 10 9.62 19 17.4 0.15
3 17 21.1 0.84 19 19.2 4.415 6 14.0 2.576 7 8.5 0.267 5 4.48 5 2.09 3 0.8
10 3 0.3> 11 1 0.1
100 100.0 27.68
7.87
11.62
Combined becauseof min Ei
-
7/28/2019 Input Modeling for Simulation
21/48
21
Kolmogorov-Smirnov Test
X
-
7/28/2019 Input Modeling for Simulation
22/48
22
Kolmogorov-Smirnov Test
Empirical Distribution
If we have n observations x1,x2,,xn, thenSn(x) = (number of x1,x2,,xn that are x) / n
0
1
x1 x2 x3 x4
D = max| F(x) - Sn(x)|
-
7/28/2019 Input Modeling for Simulation
23/48
23
Graphic Analysis
Graphic analysis includes: histogram with fitted
distribution and q-q plot. Goodness-of-fit tests represent lack of fit by a
summary statistic, while plots show where the lack offit occurs and whether it is important.
Goodness-of-fit tests may accept the fit, but the plots
may suggest the opposite, especially when thenumber of observations is small.
-
7/28/2019 Input Modeling for Simulation
24/48
24
Graphic Analysis
A data set of 30 observations isbelieved to be from a normaldistribution. The following are thep-values from chi-square test andK-S test:
Chi-square test: 0.166K-S test: >0.15
What is your conclusion?
-
7/28/2019 Input Modeling for Simulation
25/48
25
Arena Input Analyzer
A tool in Arena for fitting distributions to data.
Fits beta, empirical, Erlang, exponential, gamma,Johnson, lognormal, normal, Poisson, triangular,uniform and Weibull.
Reportsp-values for2
and K-S tests and provides ahistogram plot.
-
7/28/2019 Input Modeling for Simulation
26/48
-
7/28/2019 Input Modeling for Simulation
27/48
27
Text Report
Distribution Summary
Distribution: Lognormal
Expression: 3 + LOGN(9.22, 7.13)
Square Error: 0.004997
Chi Square Test
Number of intervals = 8
Degrees of freedom = 5
Test Statistic = 9.13
Corresponding p-value = 0.106
Kolmogorov-Smirnov Test
Test Statistic = 0.0551
Corresponding p-value > 0.15
Data Summary
Number of Data Points = 200
Min Data Value = 3.38
Max Data Value = 36.1
Sample Mean = 12
Sample Std Dev = 5.91
Histogram Summary
Histogram Range = 3 to 37
Number of Intervals = 14
-
7/28/2019 Input Modeling for Simulation
28/48
28
Usage Notes
The Fit All option tries all relevant distributions and picks theone with the smallest squared error. It can easily be fooled!
Be sure to try different numbers of histogram cells; it affectsthep-value of the 2 test, and your perception of the fit.
Since EXPO is a special case of ERLA which is a specialcase of GAMMA, Fit All rarely selects EXPO or ERLA.Similarly, EXPO is a special case of WEIB.
Raw data can be read in from text files (looks for .dst), onevalue per line.
-
7/28/2019 Input Modeling for Simulation
29/48
29
Vehicle Arrival Data (Fit All)
Distribution Summary
Distribution: WeibullExpression: -0.5 + WEIB(4.59, 1.51)
Square Error: 0.006067
Chi Square TestNumber of intervals = 6Degrees of freedom = 3
Test Statistic = 2.97Corresponding p-value = 0.414
Data Summary
Number of Data Points = 100Min Data Value = 0
Max Data Value = 11Sample Mean = 3.64Sample Std Dev = 2.76
Histogram Summary
Histogram Range = -0.5 to 11.5Number of Intervals = 12
-
7/28/2019 Input Modeling for Simulation
30/48
30
Vehicle Arrival Data (Fit Poisson)
Distribution Summary
Distribution: PoissonExpression: POIS(3.64)Square Error: 0.025236
Chi Square TestNumber of intervals = 6Degrees of freedom = 4Test Statistic = 19.8
Corresponding p-value < 0.005
Data Summary
Number of Data Points = 100Min Data Value = 0Max Data Value =11
Sample Mean = 3.64Sample Std Dev = 2.76
Histogram Summary
Histogram Range = -0.001 to 11Number of Intervals = 12
-
7/28/2019 Input Modeling for Simulation
31/48
31
q-q Plot
Recall that one way to generate data from cdfFis via
The q-q plot displays the sorteddata
vs.
)(1
RFY
=
nYYY L21
nnF
nF
nF 2/1,,2/3,2/1 111 K
-
7/28/2019 Input Modeling for Simulation
32/48
32
q-q Plot Intuition
We have a sample Y1, Y2,,Yn and we fit a distributionFthat we hope is good.
If we now generate a random sample of size n from Fit should look about like Y
1, Y
2,,Y
n.
The q-q plot generates aperfectrandom sample forcomparison.
-
7/28/2019 Input Modeling for Simulation
33/48
33
Features of the q-q Plot
It does not depend on how the data are grouped.
It is much better than a histogram when the numberof data points is small.
Deviations from a straight line show where thedistribution does not match.
A straight line implies the familyof distributions iscorrect; a 45o line implies correctparameters.
-
7/28/2019 Input Modeling for Simulation
34/48
34
Examples of q-q plot
10
15
20
25
30
10 15 20 25 30
The fitting is
pretty good!
5
10
15
20
25
30
35
40
45
50
5 10 15 20 25 30 35 40 45 50
The distributionfamily seems OKbut the parameters
are not
-
7/28/2019 Input Modeling for Simulation
35/48
35
Example of q-q plot
Poor fit! Miss badly
in the left tail
A data set of 30 observations
is believed to be from anormal distribution. Thefollowing are thep-valuesfrom chi-square test and K-S
test:
Chi-square test: 0.166
K-S test: >0.15
10
15
20
25
30
35
10 15 20 25 30 35
-
7/28/2019 Input Modeling for Simulation
36/48
36
Using the Data Itself
When might we want to use the data itself?
When no standard distribution fits well
When we have no justification for a standard distribution
When there is too little data to distinguish between
standard distributions
Empirical distribution
Each data point is equally likely to be resampled.
In Arena: DISCRETE(1/n, X1, 2/n, X2,, 1, Xn)
Example: Fit data 2.1, 5.7, 3.4, 8.1 (min) DISCRETE(.25, 2.1, .5, 3.4, .75, 5.7, 1, 8.1)
-
7/28/2019 Input Modeling for Simulation
37/48
37
Pros and Cons
As the sample size n goes to infinity, the empiricaldistribution converges to the truth.
No assumed distribution need be selected.
Only the values we saw can appear again andnothing in the gaps.
Empirical distribution has no tail. It is easy to missextreme values
-
7/28/2019 Input Modeling for Simulation
38/48
38
Interpolated Empirical To fill in gaps, we interpolate between the sorteddata points.
In Arena:
CONT(0, 2.1, .33, 3.4, .67, 5.7, 1, 8.1) CONT(0, X1, 1/(n-1), X2, 2/(n-1), X3,, 1, Xn)
Interpolated Empirical cdf
0
0.33
0.67
1
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
X
cumulativeprobability
-
7/28/2019 Input Modeling for Simulation
39/48
39
Why not always use empirical?
Tail (extreme) behavior is often not well represented(especially when we have a small sample).
It is difficult to change the data to represent a shifted
mean or reduced variance.
A fitted distribution smoothes out random sample
oddities.
I M d li i h D
-
7/28/2019 Input Modeling for Simulation
40/48
40
Input Modeling without Data
We have to use anything we can find...
Engineering data, standards and ratings can providecentral values
Expert opinion
Physical or conventional limitations can provide bounds Physical basis of the process can suggest appropriate
distribution families
B k i t M th d
-
7/28/2019 Input Modeling for Simulation
41/48
41
Breakpoints Method
Useful for modeling quantities with a large number of possibleoutcomes, like quarterly sales volume or aggregate number of
overtime hours. Minimum data needed: smallest and largest possible values.
Ex: sales of XYZ-123 will be no less than 1000 units, but no more than5000 units.
UNIF(1000,5000)
Comments:
This is typically a poor model since the probability is spread outevenly from low to high. Thus, the extremes are just as likely asthe middle values.
However, if you want to be conservative (maximum uncertainty) oryou cannot justify any additional information, then this model maybe reasonable.
B k i t M th d
-
7/28/2019 Input Modeling for Simulation
42/48
42
Breakpoints Method
Better data: smallest and largest possible values,and most likely value.
Ex: sales of XYZ-123 will be no less than 1000 units, nomore than 5000 units, and is most likely to be 3500 units.
Triangular distribution, TRIA(1000,3500,5000)Triang(1000, 3500, 5000)
Valuesx
10^-4
Values in Thousands
0
1
2
3
4
5
6
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
100.0%
0.5000 5.5000
@RISK Student VersionFor Academic Use Only
Uniform(1000, 5000)
Valuesx
10^-4
Values in Thousands
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
100.0%
0.5000 5.5000
@RISK Student VersionFor Academic Use Only
B k i t M th d
-
7/28/2019 Input Modeling for Simulation
43/48
43
Breakpoints Method
Best data: smallest and largest possible values plus 1-3 breakpoints(values and a percentage chance of being less than that value). Ex: sales of XYZ-123 will be between 1000 and 5000, and
sales chance of being sales2000 25%3500 75%4500 99%
In Arena: CONT(0, 1000, .25, 2000, .75, 3500, .99, 4500, 1, 5000)
Comments: Use only as many breakpoints as you can confidently get. Three is usually the
maximum if no data are available. Try to get breakpoints near the extremes if possible, since the extremes are often
not realistic.
Sometimes it is easier to get people to give the chance of exceeding a value,
rather than being less than a value.
M & V i bilit M th d
-
7/28/2019 Input Modeling for Simulation
44/48
44
Mean & Variability Method
Useful for modeling quantities with a large number ofpossible outcomes, like quarterly sales volume oraggregate number of overtime hours.
Also useful for modeling the variability in percentage
changes.
M & V i bilit M th d
-
7/28/2019 Input Modeling for Simulation
45/48
45
Mean & Variability Method
Minimum data: mean value and an averagepercentage variation around that mean.
Example: Last year we sold 10,000 units of ABC-000.This year we expect a 15% increase, with a typical swing
of 5% above or below that value. NORM(11500, 0.05*11500)
Comments:
May need to round this value.
If the % swing > 33%, then negative values are possible.
Mean & Variabilit Method
-
7/28/2019 Input Modeling for Simulation
46/48
46
Mean & Variability Method
Better data: mean value, an average
percentage variation around that mean andupper and lower limits.
Ex: Last year we sold 10,000 units of ABC-000. This year
we expect a 15% increase, with a typical swing of 5%above or below that value. But we wont sell less than8000 units, or more than 15,000 under any conditions
MN(15000, MX(NORM(11500, 0.05*11500),8000))
Discrete Outcomes
-
7/28/2019 Input Modeling for Simulation
47/48
47
Discrete Outcomes
Used to model discrete events, like making or notmaking a sale, or whether a new product is ready toship in the first, second or third quarter.
Data: A percentage chance of each possible
outcome so that the total is 100%.
Discrete Outcomes
-
7/28/2019 Input Modeling for Simulation
48/48
48
Discrete Outcomes
Example: We have an 80% chance of landing the contractwith Big Corp. If we land the contract, then there is only a
10% chance that the initial orders will arrive first quarter; a50% chance they will start the second quarter; and a 40%chance they will start in quarter 3. Sales will be $250K in
each quarter we produce. How to model the sales for the yearrelated to Big Corp?