Title: The Normal Distribution
1The Normal Distribution
2The Normal Distribution
- A theoretically appealing model to explain many
forms of natural continuous variation. - Many discrete distributions may be approximated
by the normal distribution for large samples - Most continuous variables, especially from the
biological sciences, are distributed normally,
e.g. the length of the femur (thigh bone) in
adult humans, the length of the tibia (shin bone)
in adult humans.
3Probability density functions (PDF) for adult
human femurs and tibias. The mean and standard
deviation of the femur PDF are arrowed
4Comparison between the PDFs for femurs and tibias
- Tibia is usually shorter than the femur, but some
people have tibias which are longer than other
peoples femurs. - The mean of tibia lengths is shorter than the
mean of the femur lengths. - Femurs are more spread out than the tibias,
leading to the PDF for tibias being more peaked. - Each normal distribution has two parameters which
control what it looks like (1) the mean, the
location about which the distribution is
centered (2) the standard deviation, the
dispersion parameter.
5The standard deviation is the square root of the
sample variance
6The standard error of the mean
- Another measure is the standard error of the mean
(or simply standard error) which arises from the
fact that bar x is only an estimate of the
population mean. Were we to repeatedly to sample
from the same population the means of the samples
would be unlikely to be the same in each case.
This means that our estimates of the population
mean would also vary. This variability is called
the standard error of the mean (SEM) - SEM(x) SD / vn
7Empirical Probability Density for the ?9-THC
content of sample marijuana from 1986
8Probability Density Function superimposed on the
histogram of ?9-THC in marijuana seizures from
1986.
9Percentage Points of the Normal Distribution
- Use the table on slide 7 to determine the
probability that the ?9-THC content of marijuana
from 1986 is 8. - This is done by summing the probabilities of
finding 6.0 ? 6.5, 6.5 ? 7.0, 7.0 ? 7.5, and 7.5
? 8.0 ?9-THC. - Very much the same thing can be done with the
normal distribution, only the summation has to be
calculated using a mathematical process called
integration. - As integration is a difficult process,
statisticians have calculated tables for a
standardised normal distribution which can be
rescaled to fit any particular normal
distribution (see appx. B). - The standard normal distribution has a mean of 0
and a standard deviation of 1.
10Standard normal distribution with mean 0 and
standard deviation 1. The shaded area covers
the area -8 (minus infinity) to 2 standard
deviations.
11Area under the curve (1)
- The shaded area under the standard normal curve
extends from -8 to 2 standard deviations. - 8 is notation for infinity. The normal
distribution is asymptotic in that it goes from
-8 to 8, so values occurring at very large ve or
ve numbers of standard deviations are very small
but never zero. - If we wish to find the area under this portion of
the curve we simply look down the rows of appx B
until we find the row labelled z 2.0. For the
third decimal place the appropriate column is
selected. - Standardised variables obtained by subtracting
the mean and dividing the result by the standard
deviation are sometimes called z-scores, or z for
short. - For z 2.00, the shaded area is 0.9772. The
total area under the curve is 1, so we can see
that 97.22 of the total area lies between -8 and
2 standard deviations.
12The distribution of slide 10 rescaled to the
normal distribution underlying the ?9-THC content
sample from 1986.
13Area under the curve (2)
- On the previous slide, the mean is 8.59 and the
standard deviation is 1.09. - The upper limit at the mean 2 standard
deviations is at 8.59 (2 1.09) 10.75 - As we know from slide 10 that 97.72 of the total
area lies in the shaded zone, we can say that
97.72 of marijuana consignments seized in 1986
will have a ?9-THC content of less than 10.75.
14The normal distribution for ?9-THC content from
marijuana contents seized in 1986 showing the
mean and the 95 symmetric area about the mean
15Area under the curve (3)
- Find the shortest interval in which 95 of the
samples fall. - By symmetry, this is centred about the mean.
- The shaded area contains 95 of the area, leaving
5 over both tails, which means 2.5 in each
tail. Thus the appropriate percentage point in
appx B is 100 2.5 97.5. - We see that 97.5 lies at 1.96 standard
deviations from the mean. - Therefore the interval which contains 95 of the
area goes between the mean minus 1.96 standard
deviations and mean plus 1.96 standard
deviations, which is 8.59 (1.96 1.09) 8.59
2.13 6.45 ? 10.73 - This means that there is a 95 probability that
the ?9-THC content of marijuana seized in 1986 is
between 6.45 and 10.73.
16The t-distribution and the standard error of the
mean (SEM)
- We previously defined the SEM as the standard
deviation divided by the square root of the
sample size. - It is a measure of the spread of confidence for
the population mean when it has been estimated by
a sample taken from the population. - For example, the ?9-THC concentration in
marijuana from 1986 has a standard deviation of
1.09, so the standard error for a sample of 20
for the estimate for the mean is 1.09 / 4.47
0.244. - A mean, however, is not in itself distributed
with a normal distribution. If we took repeated
samples, the means we found would not always be
the same, but be distributed with a
t-distribution. - The t-distribution has only one parameter
defining its shape. This is called the degrees of
freedom (df), and is based on the sample size.
17t-distribution (df 4) with a standard normal
superimposed. The tails of the t-distribution are
shaded and contain 2.5 of the area in each.
18t-testing between two independent samples
- A widespread use for the t-distribution is
testing between the means of two samples to
examine the hypothesis that the means of the
populations from which the two samples were drawn
are equal (the null hypothesis).
19Normal models for two sub-samples of n10 for the
?9-THC from seizures during 1986
20Summary statistics for the two sub-samples of
slide 19
21The object of a t-test
- From slide 19 even random sampling from the same
set of data has led to a difference in the means
of the two sub-samples of 0.76. - The object of a t-test is to look at the
differences in means and see whether the
difference is due to chance selection, or some
real populational difference.
22The null hypothesis and the alternative hypothesis
- Conventionally we start by erecting two
hypotheses. - The null hypothesis, H0, is one of no difference,
or that the means of the two sub-samples can be
regarded as belonging to the same population. - The second hypothesis is complementary to the
null hypothesis and is called the alternative
hypothesis, H1. It states that the means from the
two sub-samples are not equal, and that there are
grounds for treating the two sub-samples as being
drawn from different populations.
23Calculating t (1)
- The difference in the mean values of sub-sample 1
and sub-sample 2 is 0.77, and this difference
will have a distribution centred around 0.77
with a dispersion se(x1 x2) given by - Where x1 and x2 refer to the two sub-samples, n1
is n for sub-sample 1, n2 is n for sub-sample 2.
24Calculating t (2)
- The term s in the equation is an estimate for
the pooled variance. The reason that we need a
pooled variance is that we are trying to estimate
a distribution for the difference between two
means. - We cannot have a single unimodal distribution
which has two variances, so we need a single
estimate of variance. - This is done by a form of weighted average of the
two component variances given by the formula on
the next slide
25Calculating t (3)
26Calculating t (4)
- s1² and s2² are the variances of the two
sub-samples. Substituting the information from
the table on slide 20 into the equation on the
previous slide -
27Calculating t (6)
- Taking s and substituting into the equation of
slide 23
28Calculating t (7)
- The estimate for the standard error for the
difference between the two sample means is 0.50.
. - From appx C 95 of the probability for the
difference between the two sample means will lie
within t 2.101 standard errors of the mean (for
the t-test use df n1 n2 -2), so a confidence
interval for the difference of 0.76 will be
0.76 (0.50 2.101) 0.76 1.05 -0.29 ?
1.81 - The 95 confidence interval contains 0 as a
possible value for the difference in means
between the two samples. - Hence one would accept the hypothesis H0, and
conclude that there is no significant difference
in the means of the sub-samples at 95 confidence
(or 5 significance), and these samples could
have been taken from consignments with the same
?9-THC content.
29Comparing marijuana seizures from 1986 and 1987
30Summary statistics for the ?9-THC concentrations
in marijuana from 1986 and 1987 seizures
31Calculating t
32Interpreting t
- The difference in the two means is 8.58-7.79
0.76 with standard error 0.31 - Df 20 15 2 33
- There is no specific row in appx C for df 33,
so use three tenths of the way between df 30
and df 40. - The value of t is 2.036 standard errors.
- A 95 confidence interval for the difference in
the two means is therefore 0.76 (2.036 0.31)
0.13 ? 1.39. - This interval does not include 0, and so we may
act as if the alternative hypothesis H1 is true.
It has been shown with 95 confidence that the
?THC concentrations in the two groups is
different.
33Testing between paired observations
- Sometimes a question may arise concerning the
differences between two means which may not be
considered independent, e.g. where two treatments
are applied to the same group of individuals such
that each person receives treatment A then B,
with some observations of interest being taken
for each person under each treatment. - If the effects of treatment A are equivalent to
the effects of treatment B then the mean of the
differences should 0. - Because the observations are subject to
uncertainty, it would be unusual for the
differences to be exactly zero. - So how large must the mean of differences be
before we think that one treatment is better than
the other?
34Number of cells recovered from swabs from men
under two different extraction regimes, water and
phosphate buffered solution.
35Means and differences of numbers of cells from
three men on each of three occasions
36Calculating t for matched pairs
- The sum of the squared deviations S 18371500,
so the variance is 18371500 / n-1 2296438. - The standard deviation is the square root of the
variance 1515.4 - The standard error of the mean (SEM) is the
standard error divided by the square root of n
1514.4 / v9 505.13 - A 95 confidence interval for the mean of -714
cells can be calculated from the SEM and the
tabulated value for t in appx C for n 1 8
degrees of freedom, which is 2.306 - The 95 confidence interval is -714 (2.306
505.13) -1878.84 ? 450.84 - This confidence interval includes zero, so at 95
confidence 0 is a possible value for the mean
difference. - This gives us grounds for accepting H0, and
acting as if there were no differences in the
incidence of cell recovery between water and PBS.
37Confidence, significance, p-values (1)
- H0 is a hypothesis of no difference
- A Type 1 error is the rejection of H0 when H0 is
true - a is a pre-defined probability, called a
significance level, at which making a Type I
error is acceptable. - A p-value is the probability of finding the
observed values, or any values more extreme,
given the truth of the null hypothesis. - Confidence is the complement of significance,
that is 1-a
38Confidence, significance, p-values (2)
- H1 is a hypothesis of difference.
- A Type II error is the error of not rejecting H0
when H0 is false - ß is the probability of making a type II error
- 1-ß is called the power of a test, and can be
interpreted as the probability of detecting a
difference if one exists.