The Normal Distribution - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

The Normal Distribution

Description:

A theoretically appealing model to explain many forms of natural continuous ... we simply look down the rows of appx B until we find the row labelled z = 2.0. ... – PowerPoint PPT presentation

Number of Views:267
Avg rating:3.0/5.0
Slides: 39
Provided by: osirisSun
Category:

less

Transcript and Presenter's Notes

Title: The Normal Distribution


1
The Normal Distribution
  • Forensic Statistics

2
The Normal Distribution
  • A theoretically appealing model to explain many
    forms of natural continuous variation.
  • Many discrete distributions may be approximated
    by the normal distribution for large samples
  • Most continuous variables, especially from the
    biological sciences, are distributed normally,
    e.g. the length of the femur (thigh bone) in
    adult humans, the length of the tibia (shin bone)
    in adult humans.

3
Probability density functions (PDF) for adult
human femurs and tibias. The mean and standard
deviation of the femur PDF are arrowed
4
Comparison between the PDFs for femurs and tibias
  • Tibia is usually shorter than the femur, but some
    people have tibias which are longer than other
    peoples femurs.
  • The mean of tibia lengths is shorter than the
    mean of the femur lengths.
  • Femurs are more spread out than the tibias,
    leading to the PDF for tibias being more peaked.
  • Each normal distribution has two parameters which
    control what it looks like (1) the mean, the
    location about which the distribution is
    centered (2) the standard deviation, the
    dispersion parameter.

5
The standard deviation is the square root of the
sample variance
6
The standard error of the mean
  • Another measure is the standard error of the mean
    (or simply standard error) which arises from the
    fact that bar x is only an estimate of the
    population mean. Were we to repeatedly to sample
    from the same population the means of the samples
    would be unlikely to be the same in each case.
    This means that our estimates of the population
    mean would also vary. This variability is called
    the standard error of the mean (SEM)
  • SEM(x) SD / vn

7
Empirical Probability Density for the ?9-THC
content of sample marijuana from 1986
8
Probability Density Function superimposed on the
histogram of ?9-THC in marijuana seizures from
1986.
9
Percentage Points of the Normal Distribution
  • Use the table on slide 7 to determine the
    probability that the ?9-THC content of marijuana
    from 1986 is 8.
  • This is done by summing the probabilities of
    finding 6.0 ? 6.5, 6.5 ? 7.0, 7.0 ? 7.5, and 7.5
    ? 8.0 ?9-THC.
  • Very much the same thing can be done with the
    normal distribution, only the summation has to be
    calculated using a mathematical process called
    integration.
  • As integration is a difficult process,
    statisticians have calculated tables for a
    standardised normal distribution which can be
    rescaled to fit any particular normal
    distribution (see appx. B).
  • The standard normal distribution has a mean of 0
    and a standard deviation of 1.

10
Standard normal distribution with mean 0 and
standard deviation 1. The shaded area covers
the area -8 (minus infinity) to 2 standard
deviations.
11
Area under the curve (1)
  • The shaded area under the standard normal curve
    extends from -8 to 2 standard deviations.
  • 8 is notation for infinity. The normal
    distribution is asymptotic in that it goes from
    -8 to 8, so values occurring at very large ve or
    ve numbers of standard deviations are very small
    but never zero.
  • If we wish to find the area under this portion of
    the curve we simply look down the rows of appx B
    until we find the row labelled z 2.0. For the
    third decimal place the appropriate column is
    selected.
  • Standardised variables obtained by subtracting
    the mean and dividing the result by the standard
    deviation are sometimes called z-scores, or z for
    short.
  • For z 2.00, the shaded area is 0.9772. The
    total area under the curve is 1, so we can see
    that 97.22 of the total area lies between -8 and
    2 standard deviations.

12
The distribution of slide 10 rescaled to the
normal distribution underlying the ?9-THC content
sample from 1986.
13
Area under the curve (2)
  • On the previous slide, the mean is 8.59 and the
    standard deviation is 1.09.
  • The upper limit at the mean 2 standard
    deviations is at 8.59 (2 1.09) 10.75
  • As we know from slide 10 that 97.72 of the total
    area lies in the shaded zone, we can say that
    97.72 of marijuana consignments seized in 1986
    will have a ?9-THC content of less than 10.75.

14
The normal distribution for ?9-THC content from
marijuana contents seized in 1986 showing the
mean and the 95 symmetric area about the mean
15
Area under the curve (3)
  • Find the shortest interval in which 95 of the
    samples fall.
  • By symmetry, this is centred about the mean.
  • The shaded area contains 95 of the area, leaving
    5 over both tails, which means 2.5 in each
    tail. Thus the appropriate percentage point in
    appx B is 100 2.5 97.5.
  • We see that 97.5 lies at 1.96 standard
    deviations from the mean.
  • Therefore the interval which contains 95 of the
    area goes between the mean minus 1.96 standard
    deviations and mean plus 1.96 standard
    deviations, which is 8.59 (1.96 1.09) 8.59
    2.13 6.45 ? 10.73
  • This means that there is a 95 probability that
    the ?9-THC content of marijuana seized in 1986 is
    between 6.45 and 10.73.

16
The t-distribution and the standard error of the
mean (SEM)
  • We previously defined the SEM as the standard
    deviation divided by the square root of the
    sample size.
  • It is a measure of the spread of confidence for
    the population mean when it has been estimated by
    a sample taken from the population.
  • For example, the ?9-THC concentration in
    marijuana from 1986 has a standard deviation of
    1.09, so the standard error for a sample of 20
    for the estimate for the mean is 1.09 / 4.47
    0.244.
  • A mean, however, is not in itself distributed
    with a normal distribution. If we took repeated
    samples, the means we found would not always be
    the same, but be distributed with a
    t-distribution.
  • The t-distribution has only one parameter
    defining its shape. This is called the degrees of
    freedom (df), and is based on the sample size.

17
t-distribution (df 4) with a standard normal
superimposed. The tails of the t-distribution are
shaded and contain 2.5 of the area in each.
18
t-testing between two independent samples
  • A widespread use for the t-distribution is
    testing between the means of two samples to
    examine the hypothesis that the means of the
    populations from which the two samples were drawn
    are equal (the null hypothesis).

19
Normal models for two sub-samples of n10 for the
?9-THC from seizures during 1986
20
Summary statistics for the two sub-samples of
slide 19
21
The object of a t-test
  • From slide 19 even random sampling from the same
    set of data has led to a difference in the means
    of the two sub-samples of 0.76.
  • The object of a t-test is to look at the
    differences in means and see whether the
    difference is due to chance selection, or some
    real populational difference.

22
The null hypothesis and the alternative hypothesis
  • Conventionally we start by erecting two
    hypotheses.
  • The null hypothesis, H0, is one of no difference,
    or that the means of the two sub-samples can be
    regarded as belonging to the same population.
  • The second hypothesis is complementary to the
    null hypothesis and is called the alternative
    hypothesis, H1. It states that the means from the
    two sub-samples are not equal, and that there are
    grounds for treating the two sub-samples as being
    drawn from different populations.

23
Calculating t (1)
  • The difference in the mean values of sub-sample 1
    and sub-sample 2 is 0.77, and this difference
    will have a distribution centred around 0.77
    with a dispersion se(x1 x2) given by
  • Where x1 and x2 refer to the two sub-samples, n1
    is n for sub-sample 1, n2 is n for sub-sample 2.

24
Calculating t (2)
  • The term s in the equation is an estimate for
    the pooled variance. The reason that we need a
    pooled variance is that we are trying to estimate
    a distribution for the difference between two
    means.
  • We cannot have a single unimodal distribution
    which has two variances, so we need a single
    estimate of variance.
  • This is done by a form of weighted average of the
    two component variances given by the formula on
    the next slide

25
Calculating t (3)
26
Calculating t (4)
  • s1² and s2² are the variances of the two
    sub-samples. Substituting the information from
    the table on slide 20 into the equation on the
    previous slide

27
Calculating t (6)
  • Taking s and substituting into the equation of
    slide 23

28
Calculating t (7)
  • The estimate for the standard error for the
    difference between the two sample means is 0.50.
    .
  • From appx C 95 of the probability for the
    difference between the two sample means will lie
    within t 2.101 standard errors of the mean (for
    the t-test use df n1 n2 -2), so a confidence
    interval for the difference of 0.76 will be
    0.76 (0.50 2.101) 0.76 1.05 -0.29 ?
    1.81
  • The 95 confidence interval contains 0 as a
    possible value for the difference in means
    between the two samples.
  • Hence one would accept the hypothesis H0, and
    conclude that there is no significant difference
    in the means of the sub-samples at 95 confidence
    (or 5 significance), and these samples could
    have been taken from consignments with the same
    ?9-THC content.

29
Comparing marijuana seizures from 1986 and 1987
30
Summary statistics for the ?9-THC concentrations
in marijuana from 1986 and 1987 seizures
31
Calculating t
32
Interpreting t
  • The difference in the two means is 8.58-7.79
    0.76 with standard error 0.31
  • Df 20 15 2 33
  • There is no specific row in appx C for df 33,
    so use three tenths of the way between df 30
    and df 40.
  • The value of t is 2.036 standard errors.
  • A 95 confidence interval for the difference in
    the two means is therefore 0.76 (2.036 0.31)
    0.13 ? 1.39.
  • This interval does not include 0, and so we may
    act as if the alternative hypothesis H1 is true.
    It has been shown with 95 confidence that the
    ?THC concentrations in the two groups is
    different.

33
Testing between paired observations
  • Sometimes a question may arise concerning the
    differences between two means which may not be
    considered independent, e.g. where two treatments
    are applied to the same group of individuals such
    that each person receives treatment A then B,
    with some observations of interest being taken
    for each person under each treatment.
  • If the effects of treatment A are equivalent to
    the effects of treatment B then the mean of the
    differences should 0.
  • Because the observations are subject to
    uncertainty, it would be unusual for the
    differences to be exactly zero.
  • So how large must the mean of differences be
    before we think that one treatment is better than
    the other?

34
Number of cells recovered from swabs from men
under two different extraction regimes, water and
phosphate buffered solution.
35
Means and differences of numbers of cells from
three men on each of three occasions
36
Calculating t for matched pairs
  • The sum of the squared deviations S 18371500,
    so the variance is 18371500 / n-1 2296438.
  • The standard deviation is the square root of the
    variance 1515.4
  • The standard error of the mean (SEM) is the
    standard error divided by the square root of n
    1514.4 / v9 505.13
  • A 95 confidence interval for the mean of -714
    cells can be calculated from the SEM and the
    tabulated value for t in appx C for n 1 8
    degrees of freedom, which is 2.306
  • The 95 confidence interval is -714 (2.306
    505.13) -1878.84 ? 450.84
  • This confidence interval includes zero, so at 95
    confidence 0 is a possible value for the mean
    difference.
  • This gives us grounds for accepting H0, and
    acting as if there were no differences in the
    incidence of cell recovery between water and PBS.

37
Confidence, significance, p-values (1)
  • H0 is a hypothesis of no difference
  • A Type 1 error is the rejection of H0 when H0 is
    true
  • a is a pre-defined probability, called a
    significance level, at which making a Type I
    error is acceptable.
  • A p-value is the probability of finding the
    observed values, or any values more extreme,
    given the truth of the null hypothesis.
  • Confidence is the complement of significance,
    that is 1-a

38
Confidence, significance, p-values (2)
  • H1 is a hypothesis of difference.
  • A Type II error is the error of not rejecting H0
    when H0 is false
  • ß is the probability of making a type II error
  • 1-ß is called the power of a test, and can be
    interpreted as the probability of detecting a
    difference if one exists.
Write a Comment
User Comments (0)
About PowerShow.com