Engineering Statistics - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Engineering Statistics

Description:

a feature of the Poisson distribution is that its variance is equal to its mean. ... if X is Poisson (m) then the distribution of X can be approximated by a ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 50
Provided by: chri105
Category:

less

Transcript and Presenter's Notes

Title: Engineering Statistics


1
Engineering Statistics
The University of Reading
School of Systems Engineering
  • C. G. Guy

2
Introduction
  • course should really be called probability and
    statistics
  • both are about DATA
  • facts given, from which others may be inferred
  • statistics is about analysing measured data
  • the issues are how to collect the data and how
    much to collect so that valid conclusions can be
    drawn
  • probability is about predicting what future
    values are likely to be
  • given a model for the possible values which can
    occur
  • model may come from previously measured values or
    from theoretical considerations

3
Relationship between probability and statistics
4
How can we measure data?
  • measures fall into 2 categories
  • average value
  • dispersion (or spread)
  • both give us clues to the model for the data but
    both can hide key features of data set
  • may not be enough samples to be truly
    representative
  • can be distorted by rogue or outlying values
  • measures used for the average value are the
    mean, median and mode
  • measures used for the dispersion are the range,
    variance and standard deviation

5
Definitions of average
  • Mean
  • the arithmetic average of all the data values
  • Median
  • the middle value of the data set, when the values
    are placed in order
  • if there is an even number of values, then the
    median is half-way between the two middle values
  • Mode
  • the most commonly occurring value
  • some data sets may have more than one mode

6
A sample set of data
  • the speed of cars passing a certain point within
    a 30 mph zone
  • 31,29,28,24,29,35,37,27,29,30,32,30,29,20,45,27,35
    , 34,33,29
  • what is the mean?
  • what is the median?
  • what is the mode?
  • do these figures change if the outlier's are
    removed?

7
Data set
  • it helps if the data is presented in another form
  • this is a histogram and gives a much clearer
    picture of what is going on
  • the graphing tool in a spreadsheet can give lots
    of different ways of looking at the same data

8
Questions to ask about this data set
  • if we took another sample set, of the same number
    of cars passing the same point, would the same
    data be found?
  • would the average values and spread be the
    same
  • are there enough samples to give a true picture
    of the data?
  • could we use this data set to predict the speed
    of the next car?
  • can we go from statistics to probability and back
    again?

9
Measures of spread
  • Range
  • the difference between the smallest and largest
    data values
  • xn - x1
  • Inter-quartile range
  • each quartile is defined by dividing the number
    of data points into 4 equal parts
  • the inter-quartile range is the difference
    between the 1st and 3rd quartile values
  • Variance
  • first need to calculate the residuals
  • then the variance is given by

10
Measures of spread (continued)
  • because the variance is measured in (units)2 it
    is more common to use the standard deviation as a
    measure of spread.
  • standard deviation is given the abbreviation s
    or s depending on which definition you use

11
Measures of spread (cont)
  • s is preferred as a definition of standard
    deviation, although it is not intuitively the
    most sensible
  • the reasons are beyond the scope of an
    introductory course
  • s is used when the standard deviation of the
    whole population is being calculated
  • there are also other definitions of variance
  • divide by n or n-1
  • if the number of samples is big enough, the
    calculated values of s and s will be very
    similar
  • a good reason for taking a large number of
    samples
  • what are the inter-quartile range, variance and
    standard deviation of the data given before?

12
Box plots
  • A good way of providing visual information about
    a data set is to use a box plot
  • it can help you identify the centre, spread,
    departure from symmetry and any significant
    outliers
  • the box encloses the inter-quartile range with a
    line extending from each end of the box to the
    largest data point within 1.5 inter-quartile
    ranges of the first (and third) quartile
  • data outside this range are plotted as individual
    points

1st quartile
3rd quartile
outliers
2nd quartile
13
Models for data
  • data values which we measure can fall into a
    variety of categories
  • they can be discrete or continuous
  • they can be random or dependant
  • which categories do the following fit into
  • tossing a coin
  • drawing cards from a pack
  • height of men living in England
  • width of leaves on an oak tree
  • votes at a general election
  • can you give an example of a data set which could
    be classified as continuous and dependant?

14
Samples and Populations
  • in many situations it is impossible (or
    undesirable) to measure the full set of data
  • opinion polls before a general election
  • width of leaves on a tree
  • a sample must be taken from the whole population
  • much of the science of statistics and probability
    is concerned with determining what an appropriate
    sample is
  • how may samples and how accurate do they need to
    be?
  • what confidence do we have that our sample
    represents the true situation?
  • can we predict future values given the sampled
    values?

15
Probability
  • probability is always taken as a number lying
    between 0 and 1 and is denoted by p(x)
  • p(x) 1 means that an event is certain to happen
  • p(x) 0 means that an event is certain NOT to
    happen
  • so a toss of a coin could be represented as
  • P(heads) 0.5 and P(tails) 0.5
  • or, more formally
  • p(x) 0.5 x heads, tails
  • NOTE the use of P for an individual event and p
    for a function

16
Probability (cont)
  • if we want to be pedantic then a small
    possibility of the coin falling on its edge
    exists, so
  • p(x) 0.4999 x head, tails
  • p(x) 0.0001 x edge
  • the sum of all the p(x) values MUST be 1 as
    something must happen when a coin is tossed
  • it must land as heads, tails or on its edge
  • how would we write the probabilities associated
    with rolling a dice?

17
Rules of probability
  • there are some rules associated with
    probabilities when the events are not dependent
    on each other
  • P(A or B) P(A B) P(A) P(B)
  • P(A and B) P(AB) P(A)P(B)
  • for example, rolling a dice multiple times, or
    rolling two dice
  • if the events are not dependent but they are not
    mutually exclusive then
  • P(A B) P(A) P(B) - P(AB)
  • how can this be extended to more than two
    possible events?

18
2-Dice
19
Conditional probability
  • if the events are dependent then these rules do
    not apply - the rules of conditional probability
    must be invoked
  • P (AB) P(AB) / P(B)
  • for example, drawing a card from a pack
  • generally, if events (or instances) are
    conditional on past events or other instances,
    then probability theory becomes much more complex
  • we must revise the rules of permutations and
    combinations

20
Permutations and Combinations
  • multiplication rule
  • if you are drawing one element from each of k
    sets, with the sizes of the sets n1 , n2 , n3 ,
    .. nk then the number of different possible
    outcomes is n1n2n3..nk
  • permutations rule
  • if you are drawing k elements from a set of n
    elements and arranging the k elements in a
    distinct order, the number of different possible
    results is

21
Permutations and combinations (cont)
  • partitions rule
  • if you are partitioning the elements of a set of
    n elements into k groups, each consisting of n1 ,
    n2 , n3 nk elements the number of different
    results is
  • combinations rule
  • if you are drawing k elements from a set of n
    elements, without regard to the order of the k
    elements, the number of different possible
    results is

22
Probability mass function (p.m.f)
  • if a random variable is discrete, its possible
    states and their associated probabilities can be
    modelled by a p.m.f
  • this is a mathematical expression (often shown in
    diagrammatic form) which covers all possibilities
  • for example, a test of light-bulbs after 800
    hours use could show
  • p(x) 0.8 x working p(x)
    0.2 x failed

p(x)
23
Mean of a discrete random variable
  • for a discrete random variable with p.m.f of P(X
    x) the mean of X
  • also called the expected value of X or E(X) is
    given by
  • for example, rolling a dice or tossing a coin
    multiple times

24
The Bernoulli probability model
  • if a random variable can only take one of two
    values (which could be denoted by 0 and 1) then
    the values are said to be Bernoulli random
    variables and any observation of the variable is
    said to be a Bernoulli trial
  • examples might be tossing a coin, has a
    component failed, is a road open etc
  • clearly if P(1) p then P(0) 1 - p
  • this is usually written in the form

25
The binomial probability model
  • this is used when there are a set of Bernoulli
    trials which are independent of each other
  • for example drug trials
  • if a set of n independent Bernoulli trials each
    has an identical probability of success, p, then
    the random variable, Y , defined as the total
    number of successes over all the trials is said
    to follow a binomial distribution with parameters
    n and p.
  • this is written as Y B(n,p)

26
Cumulative distribution function of the binomial
model
  • it is often necessary to determine the
    probability associated with a random variable
    being less than or greater than a given value
  • for example, predicting the number of faulty
    components in an individual batch
  • this can be determined using a cumulative
    distribution function (c.d.f)

27
Mean of a binomial distribution
  • using the definition given before for the mean
    (or expected value) of a discrete random
    variable, the mean of a binomial distribution is
  • this looks complex, but can be shown to reduce to

28
Probability density functions
  • what happens when the variable is continuous as
    well as random
  • a probability model must be constructed by taking
    a sufficient number of samples to model the whole
    population
  • this leads to the construction of a probability
    density function
  • there are many possibilities for p.d.f
  • a common model for random, continuous variables
    is called the normal p.d.f
  • this has the form of a bell-shaped curve and is
    scaled so that the total area underneath it is 1
  • i.e..

29
The normal p.d.f
  • is defined by
  • where the parameter m is the population mean and
    the parameter s is the population standard
    deviation
  • note the use of s for the standard deviation -
    the definition implies that we know the
    characteristics of the whole population

30
The normal p.d.f
31
Probabilities associated with a normal p.d.f
p(x)
x
x1
x2
32
The standard normal distribution
  • as the integrals associated with the normal
    distribution are difficult to solve in closed
    form it is usual to tabulate values associated
    with a standard normal distribution (m 0, s
    1)
  • and to use the transform

33
The mean of a continuous random variable
  • for a continuous random variable X with p.d.f.
    p(x) over a specified range, the mean or expected
    value of X is given by
  • for the normal p.d.f, this can be shown to reduce
    to m, which is what we defined it to be, in the
    first place!

34
The variance of a random variable
  • if the variable X is random and discrete then the
    variance, Var(X) s2 , is given by
  • for a binomial distribution, this reduces to
    Var(X)npq
  • if the variable X is random and continuous then
    the variance is given by
  • in both cases, the standard deviation is s - the
    square-root of the variance.

35
The Poisson distribution
  • there are many situations where the individual
    probabilities of events occurring are unknown
    (assuming each event is a discrete, random
    variable)
  • there are many situations when an event is rare -
    there are a large number of samples (n) and the
    probability of occurrence of the event (p) is low
  • failures of electronic components
  • defects in manufacturing
  • arrival of call for a particular number at a
    telephone exchange
  • number of accidents in a factory
  • these situations are can be dealt with by using
    the Poisson distribution

36
The Poisson distribution
  • the Poisson distribution is given by
  • where m is the mean or expected value of x
  • as well as its other uses, an approximation to
    the true binomial distribution can be found from
    the Poisson distribution
  • remember that, for the binomial distribution E(x)
    np
  • hence B(n,p) Poisson(np) and B(n, m/n)
    Poisson(m)
  • a feature of the Poisson distribution is that its
    variance is equal to its mean. i.e. V(X) E(X)
    m

37
Taking random samples
  • all our work on distributions so far has
    concerned entire populations.
  • Most of the time we will only have a sample (or
    possibly, several independent samples) of the
    whole population.
  • How do the parameters of the population (m, s2)
    relate to the parameters of a random sample (
    , s2)?
  • How much belief can we have that the results of a
    sample are a true reflection of the population as
    a whole?

38
Random samples
  • If random samples (X1, X2,. Xn) are taken from a
    given population (mean m, variance s2), then
    the following statements can be made about their
    mean
  • the variance of the random samples will follow
    these rules

39
Central Limit Theorem
  • hence, if a random sample of size n is taken from
    a normal population, with mean m and variance s
    2 , then the sample mean is normally
    distributed with mean m and variance s 2/n
  • this can be generalised to the central limit
    theorem (c.l.t), which states that however the
    original population is distributed, then
  • a consequence of the c.l.t is that the total of
    the samples Tn X1 X2.Xn is also normally
    distributed

40
Central Limit Theorem (cont)
  • if X is binomial B(n, p) then the distribution of
    X can be approximated by a normal model
  • where q 1 - p. The approximation is useful when
    both np and nq are over 5
  • if X is Poisson (m) then the distribution of X
    can be approximated by a normal model
  • provided m is at least 30

41
Consequence of the Central Limit Theorem
  • suppose we have two independent populations which
    we need to compare
  • we can say that
  • and
  • consequently

42
Confidence testing
  • whenever a sample is taken we want to know
    whether it provides a reasonable estimate of the
    population as a whole
  • this can be thought of as a subjective test
  • do you think that a 1 or a 5 chance of being
    wrong is acceptable
  • a consequence of the central-limit-theorem is
    that we can use the standard normal tabulations
    to give a good idea of the confidence we can have
    in samples

43
Significance Tests
  • rules are
  • if population is N(m, s2) or can be approximated
    by it then the sampling distribution is N(m,
    s2/n)
  • null hypothesis H0 m m0
  • alternative hypothesis H1 m gt m0 or m lt m0 or m
    ¹ m0
  • critical test parameter a
  • a 5 reasonable evidence
  • a 1 strong evidence
  • a 0.5 very strong evidence
  • the hypotheses and the value of a to be decided
    before the test

44
Confidence intervals
  • e.g. m ¹ m0
  • the values of z which correspond to the
    particular confidence level we require can be
    found from the table
  • then we can determine if our sample values fall
    within these limits

45
significant values of za
  • these values (and any others, if needed) can be
    determined from the tabulated values of the
    standard normal distribution

46
Significance testing (cont)
  • need to compute critical values of z
    corresponding to chosen level of significance
    a Þ za
  • then need to compute the test statistic e.g. the
    mean
  • compare z and za

47
The Students-t distribution
  • in many real cases the value of the population
    variance (s2) is unknown, so the sample variance
    (s2) must be used as an estimate
  • this means that we can no-longer use the
    standard-normal variate z (as this needs s in its
    transform) but must define a new variate t
  • this now contains two random variables (x and s)
    and the distribution of t will depend on the
    number of samples
  • these are known as Students t-distribution and
    they are indexed by a parameter, called the
    degrees of freedom or u, where u n-1
  • tables of t for various values of u and
    significance levels can be found in textbooks or
    by using a spreadsheet

48
The chi-squared distribution
  • it is useful to be able to predict the
    distribution of the sample variance (s2)
  • just as the central-limit-theorem allowed us to
    predict the distribution of a number of sample
    means
  • this is stated here (without proof) in the
    following way
  • the distribution of s2 (the sample variance)
    follows the chi-squared c2 distribution such that
  • the chi-squared distributions are NOT bell-shaped
    and are best calculated using a spreadsheet

49
Statistics and Probability
  • there is a lot more to this subject than we have
    time to cover, for example
  • how are outlying values dealt with?
  • how do you decide which model fits your data
    best?
  • sum up by returning to
Write a Comment
User Comments (0)
About PowerShow.com