input modeling for simulation

7/28/2019 Input Modeling for Simulation

1/48

1

Input Modeling for Simulation

IEEM 313, Spring 2005

L. Jeff Hong

Dept of IEEM, HKUST


2/48

2

Input Models

Input models represent the uncertainty in a stochastic

simulation. Random variates are generated based on the input models Simulation outputs are determined by the input models

The quality of the output is no better than the quality ofinputs.

Garbage In, Garbage Out! The quality of input models depends on the raw data, which

is often collected by people who dont know simulation


3/48

3

Input Models

The fundamental requirements for an input model

are: It must be capable of representing the physical realities of

the process.

It must be easily tuned to the situation at hand. It must be amenable to random-variate generation.

There is no true model for any stochastic input. The

best that we can hope is to obtain an approximationthat yields useful results.


4/48

4

Input Models

A key distinction in input modeling problems is the

presence or absence of data: When we have data, then we fita model to the data. Good

software is available for this.

When no data are available then we have to creatively usewhat we can get to construct an input model.


5/48

5

Outline

Input modeling with data

Data collection Select candidate distributions

Fitting and checking

Arena Input Analyzer

Input modeling without data

Sources of information Incorporating expert opinion


6/48

6

Input Modeling with Data

1. Collect data.

2. Select one or more candidate distributions, based onphysical characteristics of the process and graphicalexamination of the data.

3. Fit the distribution to the data (determine values for itsunknown parameters).

4. Check the fit to the data via tests and graphical analysis.

5. If the distribution does not fit, select another candidate andgo to 3, or use an empirical distribution.


7/48

7

Suggestions on Data Collection

Plan ahead: begin by a practice or pre-observing session,watch for unusual circumstances

Analyze the data as it is being collected: check adequacy

Combine homogeneous data sets, e.g. successive timeperiods, during the same time period on successive days

Be aware of data censoring: the quantity is not observed in its

entirety, danger of leaving out extreme values

Collect input data, not performance data


8/48

8

Selecting Distributions: Histogram

A frequency distribution or histogram isuseful in determining the shape of adistribution

The number of class intervals depends on:

The number of observations The dispersion of the data

Suggested: the square root of the sample size

If few data points are available: combineadjacent cells to eliminate the raggedappearance of the histogram

Same datawith differentinterval sizes


9/48

9

Selecting Distributions: Histogram

Vehicle Arrival Example: # of vehicles arriving at an intersectionbetween 7-7:05 am was monitored for100random workdays.

There are ample data, so the histogram may have a cell foreach possible value in the data range

Arrivals per

Period Frequency0 12

1 10

2 19

3 17

4 105 8

6 7

7 5

8 5

9 3

10 3

11 1

Histogram of # of Arrivals during 7-7:05am

0

5

10

15

20

0 1 2 3 4 5 6 7 8 9 10 11

# of Arrivals

Frequ

ency


10/48

10

Selecting Distributions: Physical Basis

Most probability distributions were invented to

represent a particular physical situation. If we know the physical basis for a distribution, then

we can match it to the situation we have to model

A number of examples follow


11/48

11

binomial: Models the number of successes in n trials, when thetrials are independent with common success probability,p.Example: the number of defective components found in a lot of ncomponents.

negative binomial: Models the number of trials required toachieve ksuccesses. Example: the number of components thatwe must inspect to find 4 defective components.

Poisson: Models the number of independent events that occur in afixed amount of time or space. Ex: number of customers that arriveto a store during 1 hour, or number of defects found in 30 squaremeters of sheet metal.

normal: Models the distribution of a process that can be thought ofas the sum of a number of component processes. Ex: the time toassemble a product which is the sum of the times required for eachassembly operation.


12/48

12

lognormal: Models the distribution of a process that can be thought ofas theproductof a number of component processes. Example: therate of return on an investment, when interest is compounded, is the

product of the returns for a number of periods. Also widely used to

model stock prices.

exponential: Models the time between independent events, or aprocess time which is memoryless. Example: the time to failure for a

system that has constant failure rate over time. Note: if the timebetween events is exponential, then the number of events is Poisson.

Erlang: The sum ofkidentical exponential random variables. A

special case of the gamma...

gamma: An extremely flexible distribution used to model nonnegativerandom variables.


13/48

13

beta: An extremely flexible distribution used to model bounded

(fixed upper and lower limits) random variables.

Weibull: Models the time to failure (minimum of a number ofpossible causes); can model increasing or decreasing failure rate

hazard. Ex: the time to failure for a disk drive. discrete or continuous uniform: Models complete uncertainty,

since all outcomes are equally likely.

triangular: Models a process when only the minimum, most likelyand maximum values of the distribution are known. Ex: theminimum, most likely and maximum inflation rate we will have this

year.

empirical: Reuses the data themselves by making each observedvalue equally likely. Can be interpolated to obtain a continuousdistribution.


14/48

14

Fitting

Determine the unknown parameters for the distribution. Forexample, if you believe the data is from normal distribution,then you need to determine the mean and variance of thedistribution.

Common methods for fitting distributions are maximumlikelihood, method of moments, and least squares.

While the method matters, the variability in the data often

overwhelms the differences in the estimators (see Section 9.3).

Remember: There is no true distribution just waiting to befound!


15/48


16/48

16

Goodness-of-fit Tests

In the test...

H0: the chosen distribution fits the dataH1: the chosen distribution does not fit the data

Thep-value of a test is the Type I error level

(significance) at which we wouldjust reject H0 for thegiven data. If the level is less thanp-value, we donot reject H0; otherwise, we reject H0.

Thus, a large (> 0.10)p-value supports H0 that thedistribution fits.


17/48

17

Goodness-of-fit Tests

If there are little data, the goodness-of-fit test is likely

to accept any distributions. Why?

If there are lots of data, the goodness-of-fit test islikely to reject any distributions. Why?


18/48

18

Chi-squared Test

A histogram-based of test

X


19/48

19

Chi-square Test

Arrange the n observations into a kcells, the test statistics is:

which approximately follows the chi-square distribution with k-s-1 degrees

of freedom, where s = # of parameters of the hypothesized distributionestimated by the sample statistics.

Valid only for large sample size

Each cell has at least 5 observations for both Oiand Ei

Result of the test depends on grouping of the data

==

k

i i

ii

EEO

1

2

20

)(


20/48

20

Chi-squared Test Vehicle arrival example (page 9), sample mean 3.64

H0: Data are Poisson distributed with mean 3.64

H1: Data are not Poisson distributed with mean 3.64

Degree of freedom is k-s-1 = 7-1-1 = 5and so thep-value is

0.00004. What is your conclusion?

!

)(

x

en

xnpE

x

i

=

=xi Observed Frequency, Oi Expected Frequency, Ei (Oi - Ei)2/Ei0 12 2.61 10 9.62 19 17.4 0.15

3 17 21.1 0.84 19 19.2 4.415 6 14.0 2.576 7 8.5 0.267 5 4.48 5 2.09 3 0.8

10 3 0.3> 11 1 0.1

100 100.0 27.68

7.87

11.62

Combined becauseof min Ei


21/48

21

Kolmogorov-Smirnov Test

X


22/48

22


Empirical Distribution

If we have n observations x1,x2,,xn, thenSn(x) = (number of x1,x2,,xn that are x) / n

0

1

x1 x2 x3 x4

D = max| F(x) - Sn(x)|


23/48

23

Graphic Analysis

Graphic analysis includes: histogram with fitted

distribution and q-q plot. Goodness-of-fit tests represent lack of fit by a

summary statistic, while plots show where the lack offit occurs and whether it is important.

Goodness-of-fit tests may accept the fit, but the plots

may suggest the opposite, especially when thenumber of observations is small.


24/48

24

Graphic Analysis

A data set of 30 observations isbelieved to be from a normaldistribution. The following are thep-values from chi-square test andK-S test:

Chi-square test: 0.166K-S test: >0.15

What is your conclusion?


25/48

25

Arena Input Analyzer

A tool in Arena for fitting distributions to data.

Fits beta, empirical, Erlang, exponential, gamma,Johnson, lognormal, normal, Poisson, triangular,uniform and Weibull.

Reportsp-values for2

and K-S tests and provides ahistogram plot.


26/48


27/48

27

Text Report

Distribution Summary

Distribution: Lognormal

Expression: 3 + LOGN(9.22, 7.13)

Square Error: 0.004997

Chi Square Test

Number of intervals = 8

Degrees of freedom = 5

Test Statistic = 9.13

Corresponding p-value = 0.106


Test Statistic = 0.0551

Corresponding p-value > 0.15

Data Summary

Number of Data Points = 200

Min Data Value = 3.38

Max Data Value = 36.1

Sample Mean = 12

Sample Std Dev = 5.91

Histogram Summary

Histogram Range = 3 to 37

Number of Intervals = 14


28/48

28

Usage Notes

The Fit All option tries all relevant distributions and picks theone with the smallest squared error. It can easily be fooled!

Be sure to try different numbers of histogram cells; it affectsthep-value of the 2 test, and your perception of the fit.

Since EXPO is a special case of ERLA which is a specialcase of GAMMA, Fit All rarely selects EXPO or ERLA.Similarly, EXPO is a special case of WEIB.

Raw data can be read in from text files (looks for .dst), onevalue per line.


29/48

29

Vehicle Arrival Data (Fit All)


Distribution: WeibullExpression: -0.5 + WEIB(4.59, 1.51)

Square Error: 0.006067

Chi Square TestNumber of intervals = 6Degrees of freedom = 3

Test Statistic = 2.97Corresponding p-value = 0.414

Data Summary

Number of Data Points = 100Min Data Value = 0

Max Data Value = 11Sample Mean = 3.64Sample Std Dev = 2.76

Histogram Summary

Histogram Range = -0.5 to 11.5Number of Intervals = 12


30/48

30

Vehicle Arrival Data (Fit Poisson)


Distribution: PoissonExpression: POIS(3.64)Square Error: 0.025236

Chi Square TestNumber of intervals = 6Degrees of freedom = 4Test Statistic = 19.8

Corresponding p-value < 0.005

Data Summary

Number of Data Points = 100Min Data Value = 0Max Data Value =11

Sample Mean = 3.64Sample Std Dev = 2.76

Histogram Summary

Histogram Range = -0.001 to 11Number of Intervals = 12


31/48

31

q-q Plot

Recall that one way to generate data from cdfFis via

The q-q plot displays the sorteddata

vs.

)(1

RFY

=

nYYY L21

nnF

nF

nF 2/1,,2/3,2/1 111 K


32/48

32

q-q Plot Intuition

We have a sample Y1, Y2,,Yn and we fit a distributionFthat we hope is good.

If we now generate a random sample of size n from Fit should look about like Y

1, Y

2,,Y

n.

The q-q plot generates aperfectrandom sample forcomparison.


33/48

33

Features of the q-q Plot

It does not depend on how the data are grouped.

It is much better than a histogram when the numberof data points is small.

Deviations from a straight line show where thedistribution does not match.

A straight line implies the familyof distributions iscorrect; a 45o line implies correctparameters.


34/48

34

Examples of q-q plot

10

15

20

25

30

10 15 20 25 30

The fitting is

pretty good!

5

10

15

20

25

30

35

40

45

50

5 10 15 20 25 30 35 40 45 50

The distributionfamily seems OKbut the parameters

are not


35/48

35

Example of q-q plot

Poor fit! Miss badly

in the left tail

A data set of 30 observations

is believed to be from anormal distribution. Thefollowing are thep-valuesfrom chi-square test and K-S

test:

Chi-square test: 0.166

K-S test: >0.15

10

15

20

25

30

35

10 15 20 25 30 35


36/48

36

Using the Data Itself

When might we want to use the data itself?

When no standard distribution fits well

When we have no justification for a standard distribution

When there is too little data to distinguish between

standard distributions

Empirical distribution

Each data point is equally likely to be resampled.

In Arena: DISCRETE(1/n, X1, 2/n, X2,, 1, Xn)

Example: Fit data 2.1, 5.7, 3.4, 8.1 (min) DISCRETE(.25, 2.1, .5, 3.4, .75, 5.7, 1, 8.1)


37/48

37

Pros and Cons

As the sample size n goes to infinity, the empiricaldistribution converges to the truth.

No assumed distribution need be selected.

Only the values we saw can appear again andnothing in the gaps.

Empirical distribution has no tail. It is easy to missextreme values


38/48

38

Interpolated Empirical To fill in gaps, we interpolate between the sorteddata points.

In Arena:

CONT(0, 2.1, .33, 3.4, .67, 5.7, 1, 8.1) CONT(0, X1, 1/(n-1), X2, 2/(n-1), X3,, 1, Xn)

Interpolated Empirical cdf

0

0.33

0.67

1

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10

X

cumulativeprobability


39/48

39

Why not always use empirical?

Tail (extreme) behavior is often not well represented(especially when we have a small sample).

It is difficult to change the data to represent a shifted

mean or reduced variance.

A fitted distribution smoothes out random sample

oddities.

I M d li i h D


40/48

40

Input Modeling without Data

We have to use anything we can find...

Engineering data, standards and ratings can providecentral values

Expert opinion

Physical or conventional limitations can provide bounds Physical basis of the process can suggest appropriate

distribution families

B k i t M th d


41/48

41

Breakpoints Method

Useful for modeling quantities with a large number of possibleoutcomes, like quarterly sales volume or aggregate number of

overtime hours. Minimum data needed: smallest and largest possible values.

Ex: sales of XYZ-123 will be no less than 1000 units, but no more than5000 units.

UNIF(1000,5000)

Comments:

This is typically a poor model since the probability is spread outevenly from low to high. Thus, the extremes are just as likely asthe middle values.

However, if you want to be conservative (maximum uncertainty) oryou cannot justify any additional information, then this model maybe reasonable.

B k i t M th d


42/48

42

Breakpoints Method

Better data: smallest and largest possible values,and most likely value.

Ex: sales of XYZ-123 will be no less than 1000 units, nomore than 5000 units, and is most likely to be 3500 units.

Triangular distribution, TRIA(1000,3500,5000)Triang(1000, 3500, 5000)

Valuesx

10^-4

Values in Thousands

0

1

2

3

4

5

6

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

100.0%

0.5000 5.5000

@RISK Student VersionFor Academic Use Only

Uniform(1000, 5000)

Valuesx

10^-4

Values in Thousands

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

100.0%

0.5000 5.5000

@RISK Student VersionFor Academic Use Only

B k i t M th d


43/48

43

Breakpoints Method

Best data: smallest and largest possible values plus 1-3 breakpoints(values and a percentage chance of being less than that value). Ex: sales of XYZ-123 will be between 1000 and 5000, and

sales chance of being sales2000 25%3500 75%4500 99%

In Arena: CONT(0, 1000, .25, 2000, .75, 3500, .99, 4500, 1, 5000)

Comments: Use only as many breakpoints as you can confidently get. Three is usually the

maximum if no data are available. Try to get breakpoints near the extremes if possible, since the extremes are often

not realistic.

Sometimes it is easier to get people to give the chance of exceeding a value,

rather than being less than a value.

M & V i bilit M th d


44/48

44

Mean & Variability Method

Useful for modeling quantities with a large number ofpossible outcomes, like quarterly sales volume oraggregate number of overtime hours.

Also useful for modeling the variability in percentage

changes.

M & V i bilit M th d


45/48

45


Minimum data: mean value and an averagepercentage variation around that mean.

Example: Last year we sold 10,000 units of ABC-000.This year we expect a 15% increase, with a typical swing

of 5% above or below that value. NORM(11500, 0.05*11500)

Comments:

May need to round this value.

If the % swing > 33%, then negative values are possible.

Mean & Variabilit Method


46/48

46


Better data: mean value, an average

percentage variation around that mean andupper and lower limits.

Ex: Last year we sold 10,000 units of ABC-000. This year

we expect a 15% increase, with a typical swing of 5%above or below that value. But we wont sell less than8000 units, or more than 15,000 under any conditions

MN(15000, MX(NORM(11500, 0.05*11500),8000))

Discrete Outcomes


47/48

47

Discrete Outcomes

Used to model discrete events, like making or notmaking a sale, or whether a new product is ready toship in the first, second or third quarter.

Data: A percentage chance of each possible

outcome so that the total is 100%.

Discrete Outcomes


48/48

48

Discrete Outcomes

Example: We have an 80% chance of landing the contractwith Big Corp. If we land the contract, then there is only a

10% chance that the initial orders will arrive first quarter; a50% chance they will start the second quarter; and a 40%chance they will start in quarter 3. Sales will be $250K in

each quarter we produce. How to model the sales for the yearrelated to Big Corp?

input modeling for simulation

Documents