input modeling for simulation

Upload: -

Post on 03-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Input Modeling for Simulation

    1/48

    1

    Input Modeling for Simulation

    IEEM 313, Spring 2005

    L. Jeff Hong

    Dept of IEEM, HKUST

  • 7/28/2019 Input Modeling for Simulation

    2/48

    2

    Input Models

    Input models represent the uncertainty in a stochastic

    simulation. Random variates are generated based on the input models Simulation outputs are determined by the input models

    The quality of the output is no better than the quality ofinputs.

    Garbage In, Garbage Out! The quality of input models depends on the raw data, which

    is often collected by people who dont know simulation

  • 7/28/2019 Input Modeling for Simulation

    3/48

    3

    Input Models

    The fundamental requirements for an input model

    are: It must be capable of representing the physical realities of

    the process.

    It must be easily tuned to the situation at hand. It must be amenable to random-variate generation.

    There is no true model for any stochastic input. The

    best that we can hope is to obtain an approximationthat yields useful results.

  • 7/28/2019 Input Modeling for Simulation

    4/48

    4

    Input Models

    A key distinction in input modeling problems is the

    presence or absence of data: When we have data, then we fita model to the data. Good

    software is available for this.

    When no data are available then we have to creatively usewhat we can get to construct an input model.

  • 7/28/2019 Input Modeling for Simulation

    5/48

    5

    Outline

    Input modeling with data

    Data collection Select candidate distributions

    Fitting and checking

    Arena Input Analyzer

    Input modeling without data

    Sources of information Incorporating expert opinion

  • 7/28/2019 Input Modeling for Simulation

    6/48

    6

    Input Modeling with Data

    1. Collect data.

    2. Select one or more candidate distributions, based onphysical characteristics of the process and graphicalexamination of the data.

    3. Fit the distribution to the data (determine values for itsunknown parameters).

    4. Check the fit to the data via tests and graphical analysis.

    5. If the distribution does not fit, select another candidate andgo to 3, or use an empirical distribution.

  • 7/28/2019 Input Modeling for Simulation

    7/48

    7

    Suggestions on Data Collection

    Plan ahead: begin by a practice or pre-observing session,watch for unusual circumstances

    Analyze the data as it is being collected: check adequacy

    Combine homogeneous data sets, e.g. successive timeperiods, during the same time period on successive days

    Be aware of data censoring: the quantity is not observed in its

    entirety, danger of leaving out extreme values

    Collect input data, not performance data

  • 7/28/2019 Input Modeling for Simulation

    8/48

    8

    Selecting Distributions: Histogram

    A frequency distribution or histogram isuseful in determining the shape of adistribution

    The number of class intervals depends on:

    The number of observations The dispersion of the data

    Suggested: the square root of the sample size

    If few data points are available: combineadjacent cells to eliminate the raggedappearance of the histogram

    Same datawith differentinterval sizes

  • 7/28/2019 Input Modeling for Simulation

    9/48

    9

    Selecting Distributions: Histogram

    Vehicle Arrival Example: # of vehicles arriving at an intersectionbetween 7-7:05 am was monitored for100random workdays.

    There are ample data, so the histogram may have a cell foreach possible value in the data range

    Arrivals per

    Period Frequency0 12

    1 10

    2 19

    3 17

    4 105 8

    6 7

    7 5

    8 5

    9 3

    10 3

    11 1

    Histogram of # of Arrivals during 7-7:05am

    0

    5

    10

    15

    20

    0 1 2 3 4 5 6 7 8 9 10 11

    # of Arrivals

    Frequ

    ency

  • 7/28/2019 Input Modeling for Simulation

    10/48

    10

    Selecting Distributions: Physical Basis

    Most probability distributions were invented to

    represent a particular physical situation. If we know the physical basis for a distribution, then

    we can match it to the situation we have to model

    A number of examples follow

  • 7/28/2019 Input Modeling for Simulation

    11/48

    11

    binomial: Models the number of successes in n trials, when thetrials are independent with common success probability,p.Example: the number of defective components found in a lot of ncomponents.

    negative binomial: Models the number of trials required toachieve ksuccesses. Example: the number of components thatwe must inspect to find 4 defective components.

    Poisson: Models the number of independent events that occur in afixed amount of time or space. Ex: number of customers that arriveto a store during 1 hour, or number of defects found in 30 squaremeters of sheet metal.

    normal: Models the distribution of a process that can be thought ofas the sum of a number of component processes. Ex: the time toassemble a product which is the sum of the times required for eachassembly operation.

  • 7/28/2019 Input Modeling for Simulation

    12/48

    12

    lognormal: Models the distribution of a process that can be thought ofas theproductof a number of component processes. Example: therate of return on an investment, when interest is compounded, is the

    product of the returns for a number of periods. Also widely used to

    model stock prices.

    exponential: Models the time between independent events, or aprocess time which is memoryless. Example: the time to failure for a

    system that has constant failure rate over time. Note: if the timebetween events is exponential, then the number of events is Poisson.

    Erlang: The sum ofkidentical exponential random variables. A

    special case of the gamma...

    gamma: An extremely flexible distribution used to model nonnegativerandom variables.

  • 7/28/2019 Input Modeling for Simulation

    13/48

    13

    beta: An extremely flexible distribution used to model bounded

    (fixed upper and lower limits) random variables.

    Weibull: Models the time to failure (minimum of a number ofpossible causes); can model increasing or decreasing failure rate

    hazard. Ex: the time to failure for a disk drive. discrete or continuous uniform: Models complete uncertainty,

    since all outcomes are equally likely.

    triangular: Models a process when only the minimum, most likelyand maximum values of the distribution are known. Ex: theminimum, most likely and maximum inflation rate we will have this

    year.

    empirical: Reuses the data themselves by making each observedvalue equally likely. Can be interpolated to obtain a continuousdistribution.

  • 7/28/2019 Input Modeling for Simulation

    14/48

    14

    Fitting

    Determine the unknown parameters for the distribution. Forexample, if you believe the data is from normal distribution,then you need to determine the mean and variance of thedistribution.

    Common methods for fitting distributions are maximumlikelihood, method of moments, and least squares.

    While the method matters, the variability in the data often

    overwhelms the differences in the estimators (see Section 9.3).

    Remember: There is no true distribution just waiting to befound!

  • 7/28/2019 Input Modeling for Simulation

    15/48

  • 7/28/2019 Input Modeling for Simulation

    16/48

    16

    Goodness-of-fit Tests

    In the test...

    H0: the chosen distribution fits the dataH1: the chosen distribution does not fit the data

    Thep-value of a test is the Type I error level

    (significance) at which we wouldjust reject H0 for thegiven data. If the level is less thanp-value, we donot reject H0; otherwise, we reject H0.

    Thus, a large (> 0.10)p-value supports H0 that thedistribution fits.

  • 7/28/2019 Input Modeling for Simulation

    17/48

    17

    Goodness-of-fit Tests

    If there are little data, the goodness-of-fit test is likely

    to accept any distributions. Why?

    If there are lots of data, the goodness-of-fit test islikely to reject any distributions. Why?

  • 7/28/2019 Input Modeling for Simulation

    18/48

    18

    Chi-squared Test

    A histogram-based of test

    X

  • 7/28/2019 Input Modeling for Simulation

    19/48

    19

    Chi-square Test

    Arrange the n observations into a kcells, the test statistics is:

    which approximately follows the chi-square distribution with k-s-1 degrees

    of freedom, where s = # of parameters of the hypothesized distributionestimated by the sample statistics.

    Valid only for large sample size

    Each cell has at least 5 observations for both Oiand Ei

    Result of the test depends on grouping of the data

    ==

    k

    i i

    ii

    EEO

    1

    2

    20

    )(

  • 7/28/2019 Input Modeling for Simulation

    20/48

    20

    Chi-squared Test Vehicle arrival example (page 9), sample mean 3.64

    H0: Data are Poisson distributed with mean 3.64

    H1: Data are not Poisson distributed with mean 3.64

    Degree of freedom is k-s-1 = 7-1-1 = 5and so thep-value is

    0.00004. What is your conclusion?

    !

    )(

    x

    en

    xnpE

    x

    i

    =

    =xi Observed Frequency, Oi Expected Frequency, Ei (Oi - Ei)2/Ei0 12 2.61 10 9.62 19 17.4 0.15

    3 17 21.1 0.84 19 19.2 4.415 6 14.0 2.576 7 8.5 0.267 5 4.48 5 2.09 3 0.8

    10 3 0.3> 11 1 0.1

    100 100.0 27.68

    7.87

    11.62

    Combined becauseof min Ei

  • 7/28/2019 Input Modeling for Simulation

    21/48

    21

    Kolmogorov-Smirnov Test

    X

  • 7/28/2019 Input Modeling for Simulation

    22/48

    22

    Kolmogorov-Smirnov Test

    Empirical Distribution

    If we have n observations x1,x2,,xn, thenSn(x) = (number of x1,x2,,xn that are x) / n

    0

    1

    x1 x2 x3 x4

    D = max| F(x) - Sn(x)|

  • 7/28/2019 Input Modeling for Simulation

    23/48

    23

    Graphic Analysis

    Graphic analysis includes: histogram with fitted

    distribution and q-q plot. Goodness-of-fit tests represent lack of fit by a

    summary statistic, while plots show where the lack offit occurs and whether it is important.

    Goodness-of-fit tests may accept the fit, but the plots

    may suggest the opposite, especially when thenumber of observations is small.

  • 7/28/2019 Input Modeling for Simulation

    24/48

    24

    Graphic Analysis

    A data set of 30 observations isbelieved to be from a normaldistribution. The following are thep-values from chi-square test andK-S test:

    Chi-square test: 0.166K-S test: >0.15

    What is your conclusion?

  • 7/28/2019 Input Modeling for Simulation

    25/48

    25

    Arena Input Analyzer

    A tool in Arena for fitting distributions to data.

    Fits beta, empirical, Erlang, exponential, gamma,Johnson, lognormal, normal, Poisson, triangular,uniform and Weibull.

    Reportsp-values for2

    and K-S tests and provides ahistogram plot.

  • 7/28/2019 Input Modeling for Simulation

    26/48

  • 7/28/2019 Input Modeling for Simulation

    27/48

    27

    Text Report

    Distribution Summary

    Distribution: Lognormal

    Expression: 3 + LOGN(9.22, 7.13)

    Square Error: 0.004997

    Chi Square Test

    Number of intervals = 8

    Degrees of freedom = 5

    Test Statistic = 9.13

    Corresponding p-value = 0.106

    Kolmogorov-Smirnov Test

    Test Statistic = 0.0551

    Corresponding p-value > 0.15

    Data Summary

    Number of Data Points = 200

    Min Data Value = 3.38

    Max Data Value = 36.1

    Sample Mean = 12

    Sample Std Dev = 5.91

    Histogram Summary

    Histogram Range = 3 to 37

    Number of Intervals = 14

  • 7/28/2019 Input Modeling for Simulation

    28/48

    28

    Usage Notes

    The Fit All option tries all relevant distributions and picks theone with the smallest squared error. It can easily be fooled!

    Be sure to try different numbers of histogram cells; it affectsthep-value of the 2 test, and your perception of the fit.

    Since EXPO is a special case of ERLA which is a specialcase of GAMMA, Fit All rarely selects EXPO or ERLA.Similarly, EXPO is a special case of WEIB.

    Raw data can be read in from text files (looks for .dst), onevalue per line.

  • 7/28/2019 Input Modeling for Simulation

    29/48

    29

    Vehicle Arrival Data (Fit All)

    Distribution Summary

    Distribution: WeibullExpression: -0.5 + WEIB(4.59, 1.51)

    Square Error: 0.006067

    Chi Square TestNumber of intervals = 6Degrees of freedom = 3

    Test Statistic = 2.97Corresponding p-value = 0.414

    Data Summary

    Number of Data Points = 100Min Data Value = 0

    Max Data Value = 11Sample Mean = 3.64Sample Std Dev = 2.76

    Histogram Summary

    Histogram Range = -0.5 to 11.5Number of Intervals = 12

  • 7/28/2019 Input Modeling for Simulation

    30/48

    30

    Vehicle Arrival Data (Fit Poisson)

    Distribution Summary

    Distribution: PoissonExpression: POIS(3.64)Square Error: 0.025236

    Chi Square TestNumber of intervals = 6Degrees of freedom = 4Test Statistic = 19.8

    Corresponding p-value < 0.005

    Data Summary

    Number of Data Points = 100Min Data Value = 0Max Data Value =11

    Sample Mean = 3.64Sample Std Dev = 2.76

    Histogram Summary

    Histogram Range = -0.001 to 11Number of Intervals = 12

  • 7/28/2019 Input Modeling for Simulation

    31/48

    31

    q-q Plot

    Recall that one way to generate data from cdfFis via

    The q-q plot displays the sorteddata

    vs.

    )(1

    RFY

    =

    nYYY L21

    nnF

    nF

    nF 2/1,,2/3,2/1 111 K

  • 7/28/2019 Input Modeling for Simulation

    32/48

    32

    q-q Plot Intuition

    We have a sample Y1, Y2,,Yn and we fit a distributionFthat we hope is good.

    If we now generate a random sample of size n from Fit should look about like Y

    1, Y

    2,,Y

    n.

    The q-q plot generates aperfectrandom sample forcomparison.

  • 7/28/2019 Input Modeling for Simulation

    33/48

    33

    Features of the q-q Plot

    It does not depend on how the data are grouped.

    It is much better than a histogram when the numberof data points is small.

    Deviations from a straight line show where thedistribution does not match.

    A straight line implies the familyof distributions iscorrect; a 45o line implies correctparameters.

  • 7/28/2019 Input Modeling for Simulation

    34/48

    34

    Examples of q-q plot

    10

    15

    20

    25

    30

    10 15 20 25 30

    The fitting is

    pretty good!

    5

    10

    15

    20

    25

    30

    35

    40

    45

    50

    5 10 15 20 25 30 35 40 45 50

    The distributionfamily seems OKbut the parameters

    are not

  • 7/28/2019 Input Modeling for Simulation

    35/48

    35

    Example of q-q plot

    Poor fit! Miss badly

    in the left tail

    A data set of 30 observations

    is believed to be from anormal distribution. Thefollowing are thep-valuesfrom chi-square test and K-S

    test:

    Chi-square test: 0.166

    K-S test: >0.15

    10

    15

    20

    25

    30

    35

    10 15 20 25 30 35

  • 7/28/2019 Input Modeling for Simulation

    36/48

    36

    Using the Data Itself

    When might we want to use the data itself?

    When no standard distribution fits well

    When we have no justification for a standard distribution

    When there is too little data to distinguish between

    standard distributions

    Empirical distribution

    Each data point is equally likely to be resampled.

    In Arena: DISCRETE(1/n, X1, 2/n, X2,, 1, Xn)

    Example: Fit data 2.1, 5.7, 3.4, 8.1 (min) DISCRETE(.25, 2.1, .5, 3.4, .75, 5.7, 1, 8.1)

  • 7/28/2019 Input Modeling for Simulation

    37/48

    37

    Pros and Cons

    As the sample size n goes to infinity, the empiricaldistribution converges to the truth.

    No assumed distribution need be selected.

    Only the values we saw can appear again andnothing in the gaps.

    Empirical distribution has no tail. It is easy to missextreme values

  • 7/28/2019 Input Modeling for Simulation

    38/48

    38

    Interpolated Empirical To fill in gaps, we interpolate between the sorteddata points.

    In Arena:

    CONT(0, 2.1, .33, 3.4, .67, 5.7, 1, 8.1) CONT(0, X1, 1/(n-1), X2, 2/(n-1), X3,, 1, Xn)

    Interpolated Empirical cdf

    0

    0.33

    0.67

    1

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 2 4 6 8 10

    X

    cumulativeprobability

  • 7/28/2019 Input Modeling for Simulation

    39/48

    39

    Why not always use empirical?

    Tail (extreme) behavior is often not well represented(especially when we have a small sample).

    It is difficult to change the data to represent a shifted

    mean or reduced variance.

    A fitted distribution smoothes out random sample

    oddities.

    I M d li i h D

  • 7/28/2019 Input Modeling for Simulation

    40/48

    40

    Input Modeling without Data

    We have to use anything we can find...

    Engineering data, standards and ratings can providecentral values

    Expert opinion

    Physical or conventional limitations can provide bounds Physical basis of the process can suggest appropriate

    distribution families

    B k i t M th d

  • 7/28/2019 Input Modeling for Simulation

    41/48

    41

    Breakpoints Method

    Useful for modeling quantities with a large number of possibleoutcomes, like quarterly sales volume or aggregate number of

    overtime hours. Minimum data needed: smallest and largest possible values.

    Ex: sales of XYZ-123 will be no less than 1000 units, but no more than5000 units.

    UNIF(1000,5000)

    Comments:

    This is typically a poor model since the probability is spread outevenly from low to high. Thus, the extremes are just as likely asthe middle values.

    However, if you want to be conservative (maximum uncertainty) oryou cannot justify any additional information, then this model maybe reasonable.

    B k i t M th d

  • 7/28/2019 Input Modeling for Simulation

    42/48

    42

    Breakpoints Method

    Better data: smallest and largest possible values,and most likely value.

    Ex: sales of XYZ-123 will be no less than 1000 units, nomore than 5000 units, and is most likely to be 3500 units.

    Triangular distribution, TRIA(1000,3500,5000)Triang(1000, 3500, 5000)

    Valuesx

    10^-4

    Values in Thousands

    0

    1

    2

    3

    4

    5

    6

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    4.0

    4.5

    5.0

    5.5

    100.0%

    0.5000 5.5000

    @RISK Student VersionFor Academic Use Only

    Uniform(1000, 5000)

    Valuesx

    10^-4

    Values in Thousands

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    4.0

    4.5

    5.0

    5.5

    100.0%

    0.5000 5.5000

    @RISK Student VersionFor Academic Use Only

    B k i t M th d

  • 7/28/2019 Input Modeling for Simulation

    43/48

    43

    Breakpoints Method

    Best data: smallest and largest possible values plus 1-3 breakpoints(values and a percentage chance of being less than that value). Ex: sales of XYZ-123 will be between 1000 and 5000, and

    sales chance of being sales2000 25%3500 75%4500 99%

    In Arena: CONT(0, 1000, .25, 2000, .75, 3500, .99, 4500, 1, 5000)

    Comments: Use only as many breakpoints as you can confidently get. Three is usually the

    maximum if no data are available. Try to get breakpoints near the extremes if possible, since the extremes are often

    not realistic.

    Sometimes it is easier to get people to give the chance of exceeding a value,

    rather than being less than a value.

    M & V i bilit M th d

  • 7/28/2019 Input Modeling for Simulation

    44/48

    44

    Mean & Variability Method

    Useful for modeling quantities with a large number ofpossible outcomes, like quarterly sales volume oraggregate number of overtime hours.

    Also useful for modeling the variability in percentage

    changes.

    M & V i bilit M th d

  • 7/28/2019 Input Modeling for Simulation

    45/48

    45

    Mean & Variability Method

    Minimum data: mean value and an averagepercentage variation around that mean.

    Example: Last year we sold 10,000 units of ABC-000.This year we expect a 15% increase, with a typical swing

    of 5% above or below that value. NORM(11500, 0.05*11500)

    Comments:

    May need to round this value.

    If the % swing > 33%, then negative values are possible.

    Mean & Variabilit Method

  • 7/28/2019 Input Modeling for Simulation

    46/48

    46

    Mean & Variability Method

    Better data: mean value, an average

    percentage variation around that mean andupper and lower limits.

    Ex: Last year we sold 10,000 units of ABC-000. This year

    we expect a 15% increase, with a typical swing of 5%above or below that value. But we wont sell less than8000 units, or more than 15,000 under any conditions

    MN(15000, MX(NORM(11500, 0.05*11500),8000))

    Discrete Outcomes

  • 7/28/2019 Input Modeling for Simulation

    47/48

    47

    Discrete Outcomes

    Used to model discrete events, like making or notmaking a sale, or whether a new product is ready toship in the first, second or third quarter.

    Data: A percentage chance of each possible

    outcome so that the total is 100%.

    Discrete Outcomes

  • 7/28/2019 Input Modeling for Simulation

    48/48

    48

    Discrete Outcomes

    Example: We have an 80% chance of landing the contractwith Big Corp. If we land the contract, then there is only a

    10% chance that the initial orders will arrive first quarter; a50% chance they will start the second quarter; and a 40%chance they will start in quarter 3. Sales will be $250K in

    each quarter we produce. How to model the sales for the yearrelated to Big Corp?