Title: Module Four: Normal distribution and it
1- Module Four Normal distribution and its
applications to inter-laboratory testing - When we conduct an inter-laboratory testing, we
often observe continuous variables, - e.g., the amount of chloride of a water sample,
the beta-carotene in a blood sample, the blood
pressure are continuous variables. - When we construct a relative frequency histogram,
it is very likely that the shape of the
distribution is bell-shaped, that is a few
possible values are small, a few are large, and
most of them are around the average. - Such type of distribution is what we call NORMAL
distribution. - Fox example, Blood Pressure, the beta-carotene in
a blood sample, amount of chloride of a water
sample mostly follow normal curves.
2A histogram with imposed normal curve for 1900
individuals systolic blood pressure
- The imposed smooth curve looks like a bell-shape.
If the blood pressure follows a normal curve with
mean 115 and s.d. 14, - We use the notation X N(m,s)
- For this case, X N (115,14).
- An immediate question is How can we detect if
the distribution indeed follows a normal curve.
m115, s 14
Our interest may be to check if the blood
pressure follows a normal distribution, to find
out what proportion of individuals whose blood
pressure is at risk (150 ml or higher), or to
identify extreme cases.
3- When and How do you use Normal Distribution in
real world situations? - Normal curve describes the probability of
occurrences of many real situations. - Most of statistical techniques, including the
techniques used for analyzing inter-laboratory
testing data, assume that the response variable
approximately follows a normal curve. - These methods may not be valid if the response
does not follow a normal distribution. It is,
therefore, important to learn how to check if a
response variable follows a normal distribution
or not. For this reason, we need to learn some
basic properties of a normal distribution, to
learn how to compute probabilities and
percentiles for a normal distribution. - In this module, we will discuss
- The use of z-table and Minitab to compute
probabilities and percentiles. - Techniques of checking if a response variable
follows a normal distribution.
4- The normal probability distribution provides a
good model for describing data that have
mound-shaped frequency distributions. - The Normal Probability Distribution
-
-
- where e 2.718 and p 3.142 m and s (s gt 0
) are the parameters that represent the
population mean and standard deviation. - We will use the notation X N(m , s). This
means - X is distributed as Normal with mean m and
standard deviation s. - Some examples of normal random variables are
- X Adult Height , X Scores of s national
test, X Gas price, X Blood pressure - NOTE X salary of individuals who are 40 years
or old before retire does not follow a normal
curve. It is a skewed to right distribution.
5- Properties of Normal Distribution
- This figure shows three such distributions with
differing values of m and s . - Mean determines the center. In this case, m1 lt m2
lt m3 - Standard deviation measures the variability. In
this case, s2 lt s1 lt s3 - Large values of s reduce the height of the curve
and increase the spread. - Small values of s increase the height of the
curve and reduce the spread.
s2
s1
s3
m1 m2 m3
6Some properties for X N(m , s)
7- Example
- Every year, universities recruit students using
their SAT scores. Based on the previous
information, we know that SAT scores follows a
normal curve with the mean 1000 and standard
deviation 180. In the past, CMU admits students
with SAT 1090 or higher. - Q1 What is the percent of high school students
who can receive CMU admission? - Q2 If CMU decides to higher the SAT admission
limit to only admit the top 20 of high school
graduates. What should be the new SAT admission
limit? - Q3 A student scored 1200, and claim he is in
the top 10. Is this a correct claim?
8Tabulated Areas of the Normal Probability
Distributions
- How do you solve the SAT admission problem?
- First, we need to rewrite the problem using the
notation we are familiar. - Let call X SAT scores. Then from the given
information, we know - X N(1000, 180).
- Q1 asks for P( X gt 1090)
- Q2 asks for a value of X, call it xo, the
admission limit, so that - P( X gt xo ) .2
- Q3 asks for comparing P(X gt 1200) with .1
- How do we solve these problems?
- The probability that a continuous random variable
x assumes a value in the interval from a to b is
the area under the probability density function
between the points a and b.
9- One can use computer such as Minitab, or use a
standardized Z-table. - The Standard Normal Random Variable
- The standardized normal random variable z, is
defined as - z (x - m)/ s , or equivalently, x m zs .
- The standard probability distribution has a mean
of zero and a standard deviation of 1, that is Z
N(0,1) - The area under the standard normal curve between
mean z 0 and a specified positive value of z,
say, z0 , is the probability - Some books use this
- table. Some use other
- type of tables.
10 Back to the SAT score problem
X N(1000, 180)
P( Xgt1090)
X, SAT score
1000 1090
Z(x-1000)/180
(1000-1000)/180 0
0.5 (1090-1000)/180
The idea is to transform X N(m,s) to Z(0,1)
using z (x-m)/s P(X gt 1090) P(Z gt
(1090-1000)/180 ) P(Z gt 0.5) Now Z-table can be
applied.
11(No Transcript)
12- Example Find P (0 lt z lt 1.63)
- Solution
- Draw a normal curve, shade the area of interest.
- Rewrite the question in the way that the Z-table
can be applies. That is in the forms of - P( 0 lt Z lt zo)
- For this example, it is already in this form, so
using the Z-table, we obtain P (0 lt z lt 1.63)
.4484. - Some additional exercises
- Find P( Z lt 1.96), Find P(-1.24lt Z lt .68), Find
P( Z gt -1.64)
13- Calculating Probabilities for a General Normal
Random Variable, X - 1. Draw a normal curve for X, shade the area of
interest, - 2. Transform X to Z.
- - Standardize the interval of interest, write it
as the equivalent interval in terms of z. - - The probability of interest is the area that
you find using the standard normal probability
distribution.
14Now, Back to the the SAT example, do the
following exercises SAT score, X follows a
normal distribution with mean 1000 and s.d., 180.
That is, X N(1000, 180) Find P(X lt 800) Find
P(750 lt X lt 900) Find P(1180 lt X lt 1360)
15- How about the question of determining the SAT
admission score for CMU so that the top 20 will
receive admission from CMU. - Answer X N(1000, 180). The problem is to find
the admission score, xo so that - P(X gt x0) .2
- This is a problem we are looking for a score, not
a probability. We are reversing the problem
solving procedure, here. - Similar technique is applied here
- Draw a normal curve, shade the area of interest.
- Transform from X to Z.
- Rewrite the problem in terms of Z.
- Solve for the standardized value, zo using
Z-table reversely. - Transform zo back to xo by xo m s(zo)
16To solve for the admission score xo so that P(X gt
xo) .2 Draw the normal curve, shade the area of
interest, transform to Z. .2 P(X gt xo) P(Z gt
zo) implies P(0 lt Z lt zo) .3 This is a form we
can use Z-table. Looking inside the table, find
the closed probability to .3, which is .2995. By
the Z-table, .2995 P(0 lt Z lt .84). Therefore,
zo .84, which is the standardized admission
limit. So, solving for xo, we have xo m
s(zo) 1000 (180)(.84) 1151.2 The CMU SAT
admission limit will be about 1151.2 (In actual
application for setting up the policy, we can use
1150 as the new admission standard.)
17Hands-on activities Q-aFor the SAT example, X
(1000, 180), suppose a university admits only top
5. Find their admission limit. Q-b Find the 5th
percentile of SAT score. Q-c Find the Q3 SAT
score (75th percentile).
18- Use Minitab to compute cumulative probabilities
and percentiles for a normal distribution - Go to Calc, choose Probability Distributions,
then select Normal. - In the Dialog box, Density probability f(x),
Cumulative probability P( X lt a) for any given
a, Inverse cumulative probability is the 100pth
percentile, xo , so that P(X lt xo) p. Choose
the one you are computing. - Enter Mean and s.d.. By default, it is N(0,1).
- To compute cumulative probability, you need to
provide a values, which may be created and
recorded in a column, e.g., C3, or simply to
provide the constant a. - To compute inverse cumulative probability, you
need to provide the cumulative probabilities,
which must be in (0,1).
19- Methods for detecting the discrepancy of the
distribution of a response variable from normal
distribution. - Consider the example of Blood Pressure data. From
the histogram and the normal curve imposed onto
the histogram using Minitab, we can see that the
blood pressure generally speaking follows a
normal curve. However, there seems to have a few
unusually high blood pressures. The question is
How well the blood pressure follows a normal
curve?. - The imposing normal curve helps us to quickly
identify serious discrepancy from normal.
However, if the discrepancy is not very serious,
it is difficult to simply observe the shape of a
histogram. - We will discuss three ways for checking the
normality of a response - Imposing normal curve onto the histogram,
- Probability plot,
- Numerical methods for testing the degree of
departure from normal. -
20- Imposing a normal curve onto a histogram for the
blood pressure data of 1900 young adults between
15-20 years old -
-
The normal curve indicates there are a few large
blood pressure measurements. In fact, the
descriptive statistics shows the highest is 210,
which is much higher than 2 s.d. from the
average. It suggests 210 is very rare. One should
check immediately if there is a typo or not.
- How to construct this plot using Minitab
- Go to Stat, choose Basic Statistics, choose
Display Descriptive Statistics. - Enter the variable. Click on the Graphs option,
- In the Graphs option Dialog, you can have a
variety of choices. One of them is Histogram with
Normal Curve.
21- 2. Normal Probability Plot It is a
two-dimensional plot. - The Y-axis is the estimated cumulative
probabilities computed by - The X-axis is the original data in ascending
order. - Diagnosis
-
When the data follow a normal curve, the dotted
points should follow a straight line
When data are skewed-to-right, the plot would
look like
When data are skewed-to-left, the plot would look
like
22(No Transcript)
23- Based on the Normal probability plot, it
indicates that the systolic blood pressure does
not follow a normal curve. The pattern also shows
that the distribution is somewhat
skewed-to-the-right. - 3. Test statistic for testing if the blood
pressure follows a normal curve or not. - Graphical methods are good to show the pattern
and gives us pretty clear picture that the data
do not follow normal. Numerically, there are
methods that will test such a hypothesis. The
test statistic is given in the same graph of the
Normal Probability Plot. - The Anderson-Darlings Normality Test is
presented here. The AD-value 11.5, and the
corresponding p-value is .000 - Note p-value tells us how far the distribution
of blood pressure is away from normal. The
smaller the p-value, the less likely the response
variable follows a normal curve. A common cut-off
point is 5. In this case, p-value .000, which
is clear that the distribution of Systolic blood
pressure does not follow normal.
24- How to construct a Normal Probability Plot and
carry out the Anderson-Darlings Normality Test? - Go to Stat, choose Basic Statistics, then select
Normality Test. - In the Dialog, enter variable name.
- Reference Probabilities allow us to provide a
column of cumulative probabilities so that the
normal probability plot will show the percentiles
for each given cumulative probability.
25- Note As we have observed that all three methods
give us similar results. Therefore, the systolic
blood pressure for 15 to 20 years old young
adults does not follow a normal distribution from
the 1909 cases. - Note Once we find out the distribution is not
normal, it is critical to take some further
analysis - carefully check the data to see if there are any
typos, - Examine the data using some descriptive measures
or other plots to identify extreme cases (Details
will be discussed in another module). - Hands-on Activity
- Use the above three methods to check the
distribution of Diastolic Blood Pressure data.
26- Actions to deal with extreme cases
- For observational studies (such as survey)
- The sample sizes are usually large, and that it
is often impossible to find out possible causes
that resulted the extreme data after the data are
collect. Therefore, it is critical to collect
background and environmental variables that may
have potential impact to the results. - For experimental studies, such as
inter-laboratory testing - It is important to look for possible causes that
resulted the extremes. The study is usually
conducted under a controlled experimental
environment. It is more likely to find out causes
for the extremes, or be able to explain the
possible causes. - Deletion of extremes Vs. Making transformation
to normal - One must be careful of deleting extremes.
Especially when we are not able to find any
causes and the values are reasonable within the
context of the study. - This may be an indication that the distribution
of the response is skewed. For situations such as
this, an appropriate approach is to transform the
data to be closer to normal.
27- Method for transforming a variable to normal
- When the data show a skewed distribution,
statistical methods such as Analysis of Variance
may not be valid. An approach is to make a
mathematical transformation of the variable so
that the transformed variable will be closer to
normal. - Some tips for variable transformation
- If variable, Y, is skewed-to-right Then,
ln(Y), log10(Y), or will be closer of
normal. (If there are zeros, add each data value
by .5, first. - If variable, Y, is skewed-to-left ln(1/Y),
log10(1/Y), - or Ya, a gt1 will be closer to normal.
28- An example of Transformation
- The life time of 50 light bulbs are tested by
letting them on all the time until it burns out.
The data recorded (in months). Here are the
histogram and the normal probability test of the
raw data, the ln transformed data and Square-root
transformed data
The raw data is skewed-to-right. The Ln
transformation does not work well. The
Square-root transformation works well.
29The normal probability plots and
Anderson-Darlings tests for the life-time data
As the normal probability plots and the Normality
test results indicate, the Sqrt(Y) is
approximately normal. The other two are not.
30Hands-on Activity Analyze the distribution of
variable GR36-Lab-Mean-1 in the TAPPI
inter-laboratory testing study, and determine an
appropriate transformation to make the data
closer to a normal distribution.