AN INTRODUCTION TO STATISTICAL ANALYSIS OF SIMULATION OUTPUTS - PowerPoint PPT Presentation

About This Presentation
Title:

AN INTRODUCTION TO STATISTICAL ANALYSIS OF SIMULATION OUTPUTS

Description:

Of average number of rented campers. A maximum simulation time of 100 s ... In campers example, sequence of RNs used to generate arrivals is affected by the ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 82
Provided by: www249
Learn more at: https://www2.cs.uh.edu
Category:

less

Transcript and Presenter's Notes

Title: AN INTRODUCTION TO STATISTICAL ANALYSIS OF SIMULATION OUTPUTS


1
AN INTRODUCTION TO STATISTICAL ANALYSISOF
SIMULATION OUTPUTS
2
Sample Statistics
  • If x1, x2, , xn are n observations of the value
    of an unknown quantity X, they constitute a
    sample of size n for the population on which X is
    defined.
  • Sample mean
  • Sample variance

3
Central Limit Theorem
  • If the n mutually independent random variables
    x1, x2, , xn have the same distribution, and if
    their mean m and their variance s2 exist then

4
Central Limit Theorem
  • The random variable
  • is distributed according to the standard normal
    distribution (zero mean and unit variance).

5
Estimating a mean
  • Assume that we have a sample x1, x2, , xn
    consisting of n independent observations of a
    given population
  • The sample mean xbar is an unbiased estimator of
    the mean m of the population
  • This is the critical assumptionWithout it, we
    cannot apply the formula

6
Large populations
  • For large values of n, the (1-?) confidence
    interval for m is given by
  • with for the standard normal distribution

7
Explanations
  • 1 a expressed in percent is the level of the
    confidence interval
  • 90 means a 0.10
  • 95 means a 0.05
  • 99 means a 0.01
  • a is the error probability
  • Probability that the true mean falls outside the
    confidence interval

8
95 Confidence Intervals
  • For ?.05, za/2 1.96
  • Example
  • If 35, s 4 and n 100
  • The 95 confidence interval for m is
  • 35 1.96x4/?10 35 7.84

9
When s is not known
  • We can replace s in the preceding formula by the
    standard-deviation s of the sample
  • When n lt 30, we must read the value of za/2
    from a table of Student's t-distribution with n -
    1 degrees of freedom
  • When n ? 30, we can use the standard values

10
Confidence Intervals in CSIM
  • CSIM can automatically compute confidence
    intervals for the mean value of any table, qtable
    and so on.
  • For everything but boxes
  • xyx-gtconfidence()
  • For the elapsed times in a box
  • bd-gttime_confidence()
  • For the population of a box
  • bd-gt_number_confidence()
  • We get 90, 95 and 99 percent confidence intervals

11
Confidence Intervals in CSIM
  • We get 90, 95 and 98 percent confidence intervals
  • Computed using batch means method
  • See next section
  • But only for the mean

12
Estimating a proportion
  • A proportion p represents the probabilityP(X??)
    for some fixed threshold ?
  • 97 of our customers have to wait less than one
    minute
  • Confidence intervals for proportions are much
    easier to compute than confidence intervals for
    quantiles

13
Assumptions
  • Assume that we have n independent observations
    x1, x2, , xn of a given population variable X
    and that this variable has a continuous
    distribution
  • Let p represent the proportion we want to
    estimate, say P(X??)
  • Let k represent the number of observations that
    are ??.

14
Basic property
  • If p represent the proportion we want to
    estimate, say P(X??), and k represent the number
    of observations that are ??
  • The rv k is distributed according to a binomial
    distribution with mean np and variance p(1 p)
  • The rv k/n is distributed according to a
    binomial distribution with mean p and variance
    p(1 p)/n

15
The formula
  • When n gt 29. the distribution of k/n is
    approximately normal
  • We then have

16
AUTOCORRELATION
17
The problem
  • All statistical analysis techniques we have
    discussed assume that sample values are mutually
    independent
  • Generally false for quantities such as
  • Waiting times, response times,
  • Tend instead to be autocorrelated
  • When the waiting lines are long, everybody wait a
    long time

18
Traditional solution
  • Keep the measurements sufficiently apart
  • Sample them every T minutes apart
  • Not practical
  • Would require very long simulations

19
Three good solutions
  • Batch means
  • Regenerative method
  • Time series analysis

20
Batch means
  • We group consecutive observations into batches
  • We compute the means of these batches
  • We observe that autocorrelation among batch means
    decreases with size of batches
  • When size increases, each batch includes more
    observations that are far apart

21
Example
  • We collected the following values
  • 4, 3, 3, 4, 5, 5, 3, 2, 2, 3
  • We group them into two batches of five
    observations
  • 4, 3, 3, 4, 5 and 5, 3, 2, 2, 3
  • The batch means are
  • 3.8 and 3

22
Batch means in CSIM
  • CSIM uses fixed-size batches
  • To compute confidence intervals
  • To control the duration of a simulation(run-lengt
    h control)

23
Regenerative method
  • Most systems with queues go through states that
    return it to an state identical to its original
    state
  • The system regenerates itself
  • Examples
  • Whenever a disk array is brought back to its
    original state
  • Whenever a camper rental agency has all its
    campers available

24
Key idea
  • Define a regeneration interval as an interval
    between two consecutive regeneration points
  • Observations collected during the same
    regeneration interval can be correlated
  • Observations collected during different
    regeneration intervals are independent

25
Application
  • We group together observations that occur within
    the same regeneration interval
  • We compute the means of these groups of
    observations
  • These group means are independent from each other

26
Limitations of the approach
  • Not general
  • System must go through regeneration points
  • System must be idle
  • Leads to complex computations
  • We rarely have exactly the same number of
    observations in two different regenerations
    intervals

27
Time series analysis
  • Treats consecutive observations as elements of a
    time series
  • Estimates autocorrelation among the elements of a
    time series
  • Includes this autocorrelation in the computation
    of all confidence intervals

28
RUN LENGTH CONTROL
29
Objective
  • Accuracy of confidence intervals increases with
    duration of simulation
  • The 1/?n factor
  • We would like to be able to stop the simulation
    once a given accuracy level has been reached for
    the confidence interval of a specific measurement

30
The CSIM solution (I)
  • Specify
  • Quantity of interest
  • Mean value from a table, a qtable,
  • A relative accuracy
  • Maximum relative error
  • A confidence level (say, 0.90 to 0.99)
  • A maximum simulation duration
  • In seconds of CPU time

31
The CSIM solution (II)
  • void tablerun_length(double accuracy,double
    conf_level, double max_time)
  • void qtablerun_length(double accuracy,double
    conf_level, double max_time)
  • void meterrun_length(double accuracy,double
    conf_level, double max_time)
  • void boxtime_run_length(double accuracy,double
    conf_level, double max_time)
  • void boxnumber_run_length(double
    accuracy,double conf_level, double max_time)

32
The CSIM solution (III)
  • Example
  • thistable-gtrun_length(.01, .95, 500)
  • Specifies than we want an error less than 1
    percent for 95 confidence interval of mean of
    thistable
  • Stops simulation after 500 seconds
  • thisbox-gttime_run_length(.01, .99, 500)
  • Same for time average of a box

33
The CSIM solution (IV)
  • Replace termination test by
  • converged.wait()

34
Example (I)
  • The campers
  • We want
  • A maximum error of 1 percent (0.01)
  • For the 95 percent confidence interval
  • Of average number of rented campers
  • A maximum simulation time of 100 s
  • Will use number aspect of agency box

35
Example (II)
  • Add
  • agency-gtnumber_confidence()
  • after activation of box agency in csim process

36
Example (III)
  • Put main loop
  • while(simtime() lt DURATION)
    hold(exponential(MIART) customer()
  • in a separate arrivals process
  • Make it an infinite loop

37
Example (IV)
  • Add
  • converged.wait()
  • after call to arrivals process
  • arrivals()converged.wait()
  • Best way to let sim process generate customers
    and wait for terminationin parallel

38
Example (V)
  • The new arrivals process
  • void arrivals() process(arrivals) //
    REQUIRED for() // forever loop
    hold(exponential(MIART)) customer()
    // forever // arrivals

39
Warnings
  • Confidence intervals do not take into account
    model inaccuracies
  • While the batch means method eliminates most
    effects of measurement autocorrelation, it is not
    always 100 effective
  • The max_time parameter of the run_length() will
    not necessarily stop the simulation just after
    the specified CPU time
  • Like the emergency brake of a train

40
Objective
  • Partition processes into different classes
  • Low priority
  • High priority
  • Obtain separate statistics for each process
    priority class

41
Declaring and initializing process classes
  • To declare a dynamic process class
  • process_class c
  • To initialize a process class before it can be
    used in any other statement.
  • c new process_class("low priority")

42
To assign a priority class
  • Add inside the process
  • c-gtset_process_class()
  • Processes that have not been assigned a process
    class belong to the default process class

43
Reporting
  • Can use
  • report()
  • report_classes()

44
Other options
  • Can change the name of a process class
  • c-gtset_name("high priority")
  • Can reset statistics associated with a process
  • c -gtreset()
  • Can do the same for all process classes
  • reset_process_classes()
  • Can delete a dynamic process class
  • delete c

45
RANDOM NUMBERS
46
Background
  • We need to distinguish between
  • Truly random numbers
  • Obtained through observations of a physical
    random process
  • Rolling dices, roulette
  • Atomic decay
  • Pseudo-random numbers
  • Obtained through arithmetic operations

47
Example
  • Linear Congruential Generators (LCG)
  • Easy to implement and fast
  • Defined by the recurrence relation
  • rn1 (a rn c ) mod m
  • where
  • r1, r2, are the random values
  • m is the "modulus
  • r0 is the seed

48
Two realizations
  • GCC family of compilers
  • m 232 a 69069 c 5
  • Microsoft Visual/Quick C/C
  • m 232 a 214013 c 2531011

49
Problems with pseudorandom number generators
  • Much shorter periods for some seed states
  • Lack of uniformity of distribution
  • Correlation of successive values

50
Better RNGs
  • Use the Mersenne twister
  • Period is 219937 - 1
  • Blum-Blum-Schub

51
A quote
  • "Any one who considers arithmetical methods of
    producing random digits is, of course, in a state
    of sin. For, as has been pointed out several
    times, there is no such thing as a random number
    there are only methods to produce random numbers,
    and a strict arithmetic procedure of course is
    not such a method. John von Neumann

52
CSIM RNGs
  • BY default, CSIM uses a single stream of random
    numbers
  • Can reset the seed using
  • void reseed(stream s, long n)
  • as in
  • reseed(NIL, 13579)

53
Continuous distributions supported by CSIM (I)
  • double uniform(double min, double max)
  • double triangular(double min, double max,double
    mode)
  • double beta(double min, double max,
    doubleshape1, double shape2)
  • double exponential(double mean)
  • double gamma(double mean, double stddev)
  • double erlang(double mean, double var)

54
Continuous distributions supported by CSIM (II)
  • double hyperx(double mean, double var)
  • double weibull(double shape, double scale)
  • double normal(double mean, double stddev)
  • double lognormal(double mean, double stddev)
  • double cauchy(double alpha, double beta)
  • double hypoexponential(double mn, double var)

55
Continuous distributions supported by CSIM (III)
  • double pareto(double a)
  • double zipf(long n)
  • double zipf_sum(long n, double sum)

56
Discrete distributions supported by CSIM
  • long uniform_int(long min, long max)
  • long bernoulli(double prob_success)
  • long binomial(double prob_success,
    longnum_trials)
  • long geometric(double prob_success)
  • long negative_binomial(long success_num,double
    prob_success)
  • long poisson(double mean)

57
Empirical distributions supported by CSIM
  • Not clear enough

58
Using multiple streams
  • In campers example, sequence of RNs used to
    generate arrivals is affected by the numbers of
    campers
  • If agency has less campers
  • More customers will be lost
  • Lost customers do no generate any RN
  • Better to have separate random number streams for
    arrivals and service times

59
Declaring and initialing random number streams
  • Can use
  • stream ss new stream()
  • By default, streams are created with seeds that
    are spaced 100,000 values apart

60
Reseeding a stream
  • Use
  • s-gtreseed(24680)
  • where the new seed is a long integer

61
Other functions
  • Inspect the current state of a stream
  • i s-gtstate()
  • Delete a stream
  • delete s

62
Using a specific seed
  • Prefix RNG function with name of seed
  • s-gtuniform (3.0, 7.0)

63
A CASE STUDY
64
RAID array revisited
  • Reliability of a RAID array
  • Reliability R(t) of a system is the probability
    that will remain operational over a time interval
    0. t given that it was operational at time t
    0
  • Not the same as availability
  • Our focus is evaluating the risk of a data loss
    during array lifetime

65
Executing multiple runs
  • Multiple runs provide statistically independent
    repetitions of original simulation
  • Useful for
  • Collecting more accurate results
  • Constructing confidence intervals
  • Use rerun() function within a loop
  • create("sim") call must be inside that loop

66
Overview of sim function
  • extern "C" void sim() // sim process
  • runcount 0
  • while(runcount lt NRUNS)
  • create("sim") // make it a process
  • // usual contents of sim process
  • rerun()
  • runcount
  • // while
  • report_hdr() // produce statistics
    report
  • // sim

67
Global includes and defines
  • include ltcpp.hgt // CSIM C header file
  • define NDISKS 5 // number of disks in array
  • define NYEARS 5 // useful lifetime of array
  • define MTTF 300000.0 // disk MTTF
  • define MTTR 24.0 // disk MTTR
  • define NRUNS 100000 // no of runs
  • void disk(int i)

68
Global declarations
  • int nfailed // number of failed disks
  • int runcount 0 // number of runs
  • int ndatalosses 0 //counter
  • double lifetime NYEARS 365 24
  • // simulation duration

69
Sim function (I)
  • extern "C" void sim() // sim process
  • int i
  • runcount 0
  • while(runcount lt NRUNS)
  • create("sim") // make this a process
  • dataloss new event("dataloss")
  • dataloss-gtclear()
  • nfailed 0

70
Sim function (II)
  • // create NDISKS disk processes
  • for (i0 i lt NDISKS i)
  • disk(i)
  • // for
  • dataloss-gttimed_wait(lifetime)
  • if (simtime() lt lifetime)
  • ndatalosses
  • // if
  • rerun()
  • runcount
  • // while

71
Sim function (III)
  • report_hdr() // produce statistics report
  • printf("Array lifetime fd years",
  • NYEARS)
  • printf(or f hours\n", lifetime)
  • printf("Sim time f\n", simtime())
  • printf("Disk MTTF f hours\n", MTTF)
  • printf("Disk MTTR ff hours\n", MTTR)
  • printf(Completed runs d\n", runcount)
  • printf(Data lossesed\n", ndatalosses)
  • // sim

72
Disk process (I)
  • void disk(int i)
  • create("disk")
  • while(simtime() lt lifetime dataloss)
  • hold(exponential(MTTF))
    // disk failed
  • nfailed

73
Disk process (II)
  • if (nfailed 2)
  • dataloss 1
  • failtime simtime()
  • finish-gtset()
  • terminate()
  • // if
  • hold(MTTR) // repair process
  • nfailed-- // disk is replaced
  • // while
  • // disk

74
Simulation Outcome
  • C/CSIM Simulation Report (Version
    19.0 for Linux x86)
  • Tue Apr 22 101314
    2008
  • Ending simulation time
    0.000
  • Elapsed simulation time
    0.000
  • CPU time used (seconds)
    0.000
  • Array lifetime 5 years
  • which corresponds to 43800 hours
  • Simulated time 0
  • Disk MTTF 300000 hours
  • Disk MTTR 24 hours
  • Number of runs completed 100000
  • Number in runs ending in data loss 17

75
Discussion
  • Whole program took less than 98 seconds on
    linux03 server
  • Simulated time should be equal to43,800 hours,
    not zero.
  • rerun() artifact?
  • We observe 17 failures out of 100,000 runs
  • Data survival rate is 99.983 percent
  • Three nines

76
Confidence intervals
  • Data loss rate and survival rate are distributed
    according to a binomial law
  • Since n 100,000, the distributions of both
    proportions are approximately normal
  • Should use

77
CI for data survival rate s
  • s 99.983 or 0.99983
  • We compute
  • ?(s(1 s)/n)
  • ?(0.00017 x 0.99987)/100000) 0.0000412
  • 95 CI is
  • 0.99983 1.96 x 0.0000412
  • 0.99983 0.000081
  • 0.999749, 0.99911

78
CI for data loss rate
  • 0.000017 0.000081
  • 0.000089, 0.000251

79
Extensions (I)
  • Other repair time distributions
  • Exponential ( to compare with results of
    stochastic analysis)
  • Ad hoc (80 of repairs within one day, 20 within
    two days)
  • Taking accounts of differences between day and
    night, weekdays and weekends
  • Will map simtime() into a calendar

80
Extensions (II)
  • Other disk arrays
  • Mirrored disks NDISKS 2
  • RAID level 6 tolerate two disk failures
  • Triplicate disks tolerate failure of two out of
    three disks
  • Two-dimensional RAID arrays
  • See next slide
  • Will require more work

81
Two-dimensional RAID arrays
Write a Comment
User Comments (0)
About PowerShow.com