Title: Selecting Input Probability Distributions
1Selecting Input Probability Distributions
2Outline
- Sources of randomness
- Pitfalls in modeling simulation input data
- Choosing a distribution when data are available
- Hypothesizing families of distributions
- Estimation of parameters
- Determining how representative the fitted
distributions are - Chi-square test
- Choosing a distribution in the absence of data
3Sources of Randomness
- Almost all real-world systems contain one or more
sources of randomness. The following are common
sources of randomness in manufacturing systems - Interarrival times of parts or raw materials,
- Processing or assembly times of parts,
- Times to failure of a machine,
- Repair times for a machine, and
- Setup times for a machine.
- Failure to choose correct probability
distributions may drastically affect models
results.
4Pitfalls in Modeling Simulation Input Data
- 1) Replacing a distribution by its mean
- Example Assume an insurance company with a claim
department of 3 employees each claim is
processed by the three employees. - Insurance claims arrive at the claims department
every 10 minutes (inter-arrival time) for
processing. - When a claim arrives, it takes 1 min. to transfer
the claim to the first employee. If the first
employee is not free, the claim waits on his
desk. When the first employee becomes free, it
takes 10 min to process the claim. When the first
employee finishes working on the claim, the claim
is transferred to the second employee for further
processing. This transfer takes 1 min. - Once the second employee is available, it takes
10 min to complete his portion of the process.
When the second employee finishes, the claim is
transferred to the third and final employee. This
transfer takes 1 min. - Once the third employee is available, it takes 10
min to perform his portion of the process. When
the third employee finishes, the claim is
complete and is transferred to the mailroom where
it is sent to the customer with the approval or
disapproval decision.
5Pitfalls in Modeling Simulation Input Data
Simple graphical representation
Model Input data
Simulation model (Averages)
Run the model
6Pitfalls in Modeling Simulation Input Data
Simulation Output (Averages)
- From the animation and the output
- Queues are not building
- Cycle time is not fluctuating,
- No problems in the system.
- Note This output is similar to using a static
tool like a spreadsheet or a process map.
7Pitfalls in Modeling Simulation Input Data
Reality
- In reality, the arrival of the claims and
department operations would never work in perfect
rhythm, there is variability. - In reality, variability occurs in every day
situations and in any business. This is where
the power of simulation over other methods
arises. - Variability and its effect on business operations
and decision making will be demonstrated in the
claims department simulation model.
8Pitfalls in Modeling Simulation Input Data
The inter-arrival rate, processing times, and
transfer times used previously in the example
were Averages. Let us go back to
reality! Variability
The Real model Input (Variability -
Distributions)
Real distributions
Press here
9Distributions Used in the Model
Mean 1
Mean 10
Mean 10 s 2
Min 8 Mode 10 Max 12
s Standard deviation
Min 8 Max 12
10Pitfalls in Modeling Simulation Input Data
Simulation model (Distribution)
Run the model
Simulation Output (Distribution)
- From the animation and the output
- Queues are building
- Cycle time is fluctuating,
- There are significant problems in the system.
- The output is not similar to the output based on
averages.
11Pitfalls in Modeling Simulation Input Data
12Pitfalls in Modeling Simulation Input Data
Averages
Distributions (Variability)
It is evident that using the average only can
have a large impact on simulation output and on
the quality of decisions made with the simulation
results.
13Pitfalls in Modeling Simulation Input Data
(Contd)
- 1) Replacing a distribution by its mean ?
- 2)Selecting the wrong distribution
- In the example Suppose that 200 claims
processing times are available for the first
process but their underlying probability
distribution is unknown. Using some methods
(described later), The following distributions
are fit to the observed data - Normal, Triangular, Lognormal, Beta and Weibull.
-
-
14Distributions Used for Process 1
15Pitfalls in Modeling Simulation Input Data
(Contd)
- Then, a simulation run of length 1600 hours is
made using each of the five distributions. If
the normal distribution is the best fit for the
data, the following errors for cycle time are
observed when using other distributions
It is evident that the choice of probability
distribution can have a large impact on
simulation output and on the quality of decisions
made with the simulation results.
16Choosing a Distribution When Data are Available
- There are three steps in determining what
probability distribution best represents a set of
data - 1. Hypothesize families of distributions,
- 2. Estimate parameters, and
- 3. Determine how representative the fitted
distributions are.
17Hypothesizing Families of Distributions
- The first step in selecting a particular input
distribution is to decide what general families
(e.g., exponential, normal) appear to be
appropriate on the basis of their shapes. - Some general techniques used in hypothesizing
families of distributions include using - Prior knowledge
- Summary statistics
- Histograms
18Use of Prior Knowledge
- In some situations, prior knowledge about a
certain random variables role in the system can
be used to select a distribution or at least to
rule out some distributions. For example, - If customers arrive one at a time, at a constant
rate, so that the numbers of customers arriving
in disjoint time intervals are independent, the
interarrival times are probably exponentially
distributed. - Service times should (at least in principle) not
be generated directly from a normal distribution. - The proportion of defective items in a large
batch should not be assumed to have a gamma
distribution, since proportions must be between 0
and 1 and gamma random variables have no upper
bounds.
19Use of Summary Statistics
- Summary statistics may be used in some situations
to suggest an appropriate distribution. Some
guidelines are - For a symmetric continuous distribution (e.g.,
normal) the mean is equal to the median. - If the coefficient of variation, cv, is close to
one, it suggests an exponential distribution. - Skewness is a measure of the symmetry of a
distribution. - for symmetric distributions (e.g., normal)
- skewness 0
- if the distribution is skewed to the right
- skewness gt 0
- if the distribution is skewed to the left
- skewness lt 0
20Use of Summary Statistics (Contd)
- For a discrete distribution, the lexis ratio
plays an important role - for Poisson lexis ratio 1
- for binomial lexis ratiolt1
- for negative binomial lexis ratiogt 1
21Estimation of Parameters
- Once one or more candidate families of
distributions have been hypothesized, the values
of their parameters (i.e., shape, scale, or
location) must be specified. - The most popular method for estimation of
parameters is the method of maximum likelihood. - For a particular distribution, the method of
maximum likelihood selects those values for the
parameters that maximize the likelihood (or
probability) of having obtained the observed data
from the distribution.
22Determining How Representative the Fitted
Distributions Are
- After determining one or more probability
distributions that might fit the observed data,
the quality of the fitted distributions must be
evaluated using one or more heuristics. - Two heuristics used in determining the goodness
of fit are - The Chi-square test
- The Kolmogorov-Smirnov test
23Chi-square Test
- The chi-square test measures the error between a
candidate distributions density function and the
histogram. - The test statistic is
- where
- k Number of intervals
- Nj Number of observations in the interval
aj-1, aj) - npj Expected number of observations that would
fall in the jth interval if we were sampling from
the fitted distribution. -
- If , the
hypothesized distribution is rejected. -
24Choosing a Distribution in the Absence of Data
- In some situations it is not possible to collect
data on the random variables of interest.
Examples include - A manufacturing system under study that does not
currently exist, or - An existing system where the number of required
probability distributions is large and the time
available prohibits necessary data collection and
analysis.
25Choosing a Distribution in the Absence of Data
(Contd)
- Two heuristic approaches for choosing a
distribution in the absence of data involve - Using a triangular distribution
- Using a beta distribution
- The first step in using either heuristic is to
identify an interval a,b in which it is felt
that X (for example, the time to perform a task)
will lie with probability close to 1. - In order to obtain subjective estimates of a and
b, experts are asked for their most optimistic
and pessimistic estimates of the time to perform
the task.
26Choosing a Distribution in the Absence of Data
(Contd)
- Using a triangular distribution In addition to
a and b (minimum and maximum values for time to
perform a task), the experts are asked to specify
the most likely time to perform the task, denoted
by m. - The advantage of this approach is that it is
simple and it is usually possible to obtain
estimates for a, b, and m. - The disadvantage of this approach is that it is
not flexible and may lead to large errors.