Normal Distribution

About This Presentation

Title:

Normal Distribution

Description:

You may know it by it's more common name the 'bell-shaped curve' or 'Guassian ... Kurtosis: the ratio of kurtosis to its standard error. ... – PowerPoint PPT presentation

Number of Views:259

Avg rating:3.0/5.0

Slides: 30

Provided by: drmarkjk

Category:

more less

Transcript and Presenter's Notes

Title: Normal Distribution

1
Normal Distribution

HED 489 Biostatistics

Odds are that you've come across the normal
distribution below. You may know it by it's more
common namethe "bell-shaped curve or Guassian
curve (knowing this is very impressiveuse it in
a conversation with your family todaytheyll be
impressed). In case you've never seen it before,
that's the normal distribution on the right. We
spend time understanding the normal distribution
for two reasons
it forms the basis of probability, and
probability forms the basis of parametric
statistical tests
whether the distribution is normal is one of
theperhaps the primary determinant of which
family of statistical tests you will apply.

3
Formation of The Normal Curve

Assume for the moment that the data in the slide
to the right represent ages of a bunch of people.
As you can see, young people are on the left and
older people on the right. Most of the people are
between 9 and 11, right? That's the big bunch in
the middle.

Now, let's say that we draw a line connecting the
top midpoint of each bar. Here's what that would
look like

See the nice straight lines you get? That's
because, for purposes of this explanation, I've
created the distribution such that those straight
lines would result! Now, how about if we smooth
the lines so they become a nice curve. That's
what this next slide shows

See how we have the outline of a bell-shaped
curve. This is an approximation of what would
happen with these data. The last thing that
happens when we have a normal distribution is
that the outline becomes filled in with data.
That's what this next slide shows.

So, if we strip away the unessential information,
we're left with the first slide you saw in this
lesson. Remember what it looked like?

8
Normal Curve Properties

Symmetry the mean bisects the distribution
exactly, such that the two halves of the
distribution form a mirror image of each other.
Unimodal one mode
Standardized characteristics the standard
deviation of a standardized normal distribution
will always be 1.0 with a mean of 0.
Infinite theoretically, the "tails" of the
distribution never contact the x-axis. In other
words, the tails are infinite.

9
Characteristics of a Normal Distribution

Bell-shaped examine the shape of the curve
Points of inflection the point at which a curve
changes from concave to convex.
Skewness the ratio of skewness to its standard
error.
lt-2 than left or negatively skewed gt2 than right
or positively skewed.
Kurtosis the ratio of kurtosis to its standard
error.
lt-2 than tails longer than expected gt2 than
tails shorter than expected.
Central Tendencies the mean, median, and mode
have the same point estimate.
Percentage of Scores for example, 68 of scores
fall symmetrically within 1 standard deviation
95 fall within 2, and 99 fall within 3
standard deviation.

This is REALLY Important to knowLOOK it over and
KNOW these percentages
10
Standardized normal distribution, z-scores, and
probability

By standardizing normally distributed scores, one
could better understand and compare scores.
Standardizing is a process through which scores
are transformed into a common scale (in this
case, z-scores).
Probabilities within the normal distribution
could be represented by z-scores and visa versa
The standardized normal distribution has a mean
equal to 0 and a standard deviation equal to 1.

11
Moving Towards Probability

As you can see on the slide to the right, we can
plot the location of various values by adding or
subtracting the standard deviation (1.0) to or
from the mean (0). See where positive and
negative standard deviations fall on this
distribution? Typically, we don't go beyond 3.0
standard deviations, commonly abbreviated "s.d."
We'll talk about why in just a bit.
Remember, if we were measuring age, our raw data
s.d. would be in increments of years. If we were
measuring water consumption, our raw data
increment might be ounces. However, both sets of
data can be transformed into standardized scores
(z-scores).

If you collect the ages of, say, 10,000 people,
and build a frequency distribution, it will
contain all 10,000 people, right? That goes
without saying, doesn't it? Well, the same thing
can be said of the normal distribution. That is,
all of the data, or 100 of it for a particular
variablelike agewill be contained within the
distribution. For any normal distribution, no
matter what the data are, 99.7 of the data will
be contained in the space from -3 to 3 s.d. Do
you see that in the slide on the right? See all
the blue area between -3 and 3? That's where
most of the data are. See how little blue there
is to the left of -3 and to the right of 3?
There are very few cases there. Combined, in
fact, only .3 of the data are greater than 3 or
less than -3.

So, if you had those 10,000 ages on slips of
paper, and you selected one at random, it would
fall between 3.0 s.d. 99.7 of the time. That is
to say there is a probability of .997 of
selecting an age at random that falls between
3.0 s.d. Because the largest area under the
curve falls between 1.0 s.d., it stands to
reason that most of the cases will be contained
in this area, too. As you can see from the slide,
about 68.2 of the cases (34.1 2) fall between
1.0 s.d.
An additional 14 of the cases are contained
between 1.0 and 2.0 s.d., and 14 more fall
between -1.0 and -2.0 s.d. See that? In total,
95.4 of the cases fall between 2.0 s.d.
Because there's not much area under the curve
between 2.0 and 3.0 s.d., only 2.2 of the cases
will fall between 2.0 and 3.0 s.d., and between
-2.0 and -3.0 s.d. Remember we said earlier that
99.7 of the cases fall between 3.0 s.d.? Well,
this is how we get to 99.7.
If you didn't follow that, go back and study it
again until you understand it.

14
Taking Another Step

This slide summarizes what you just learned. That
is, about 68 of the cases fall between 1.0
s.d., about 95 between 2.0 s.d., and about 99
fall between 3.0 s.d. Now, pay close attention
it's going to get tricky If about 95 of the
cases fall between 2.0 s.d., then about 5 will
be greater than -2.0 or 2.0 s.d. Do you see
that? If all, or 100 of the cases fall somewhere
along the curve, and you account for 95 of them
between 2.0 s.d., then 5 are left. About 2.5
of the cases fall to the left of -2.0 s.d. and
another 2.5 fall to the right of 2.0 s.d.
Still with me? Ok, contemplate this if you pick
a value at random, there is a probability of .05
that it will be greater than -2.0 or 2.0 s.d.
Why? Because there's a probability of .95 that it
will be between 2.0 s.d.

15
Calculating the z-score

Calculating the z-score is relatively simple.
Just follow the formula.

You will need to know the mean and the standard
deviation. Just punch in the score you need and
you'll get the z-score. For example, if you
scored 100 on the test, and the mean is 80 and
the standard deviation is 10, you'll have the
formula as such
z 100 - 80 / 10 z 20 /10z 2
That is, your score was
two standard deviations
above the mean your
score was higher than
97.7 of scores.

17
The Normal Distribution What does the z Score
mean?

It's time for you to open your textbook to the
inside back cover (to Table A). Click slide and
it will appear on-screen (it is also shown on the
next slide in this program). This table shows the
area under the curve from zero to any point along
the curve, out to three decimal points, to 4.0
s.d.
This table is also referred to a table of
"z-scores." See the "z" in the upper left cell?
For a normal distribution of data, the standard
deviation is equal to z. More on that in a
moment. First, I want to get you comfortable with
this table.
Remember that about 34 of the cases fall between
zero and 1 s.d.? In case you forgot that, just
look at the z-score table, and it will remind
you. If we selected a case at random from a
normal distribution of data that the probability
is about .34 that the value of that case will
fall between zero and 1.0 s.d.?

Look down the left column of Table A (normal
curve or z-score table), either in the book or on
the slide at the top. Move your finger one
column to the right. That's the one labeled ".00"
The value in the cell where your finger is
pointing is .3413, right? (next to the z- score
of 1.0) That's where I found "about" 34 or 34.

19
z(score)-(mean of scores)/standard deviation

The above formula is used to calculate the
z-score. What happens if you're interested in
some s.d. other than 1.0, 2.0, or 3.0? Here's
where the table really comes in handy! Say you're
interested in the area under the curve (also
known as "probability") for a s.d. (or z-score,
as I want you to begin thinking, as well) of
1.96? How do you get there. Well, run your finger
down the first column until you get to 1.9, and
then run across until you get to .06. What value
did you find? .4750? If so, you did it just
right!
Note that the z-score table just shows half of
the curve, that is, from zero to the right.
That's because the curve is symmetrical, so if
you want to know the area, or probability, for
the other half of the curve, just double the
tabled value. For example, if you wanted to know
the probability for 1.96, multiply .4750 time 2.
That would give you .9500, and you'd say that the
probability of selecting a value at random that
would fall between 1.96 s.d. would be .95.
Now, I want to give you some practice working
with the normal curve so I know that you've
become comfortable with it.

20
Calculate the percent under the curve between the
mean and a z-score (or s.d.) of -1.39.

To do this, use the z-score table. Run your
finger down the first column until you reach 1.3.
Run your finger along the 1.3 row until you reach
the .09 column. The figure in that cell is the
area under the curve from 0.00 to 1.39 (or 0 to
1.39). That number is .4177 (or 41.77). This
is also the probability of selecting a value at
random and having it fall between 0.00 and 1.39
(or -1.39).

The probability of selecting a scorebetween the
mean and -1.39 sd or a z-score of -1.39 is .4177
21
Calculate the probability of selecting a value at
random greater than a z-score (or s.d.) of 2.01.

You should be able to find the area under the
curve, also known as probability, for a value of
2.01. Run down the first column until you get to
2.0. Then across the row to the .01 column.
However, consider the curve at the lower right.
See the figure .4772? That's the area under the
curve for 0.00 to 2.00. What you want is the area
beyond, or to the right of 2.01. To get that, you
have to subtract the probability for 0.00 to 2.01
(.4778) from all of the area on the right side of
the curve. That is, from 0.00 to infinity. What
is this figure? Can't remember? (remember that
.50 is to the right of the mean and .50 is to the
left. Focus on the right .50-.4778 .0222.
The probability of selecting a value greater than
a z score of 2.01 is .0222. Look at the z-score
table.
Remember, report probability, not percentage.
Click to get table.

22
Calculate the probability of randomly selecting a
value that is greater than a z-score of 2.00 or
less than a z-score of -2.00.

This assignment requires you to consider both
sides of the normal curve. What are the chances
of selecting a z score beyond 2.0 or -2.0?
Like the last assignment, you need to calculate
"what's remaining" under the curve from 0.00 to
-2.00 and 0.00 to 2.00. To do this, you need to
determine the probability from 0.00 to 2.00 (that
is .4772 subtract this from .50 to get .0228)
double that value to get the probability of
obtaining a z score greater than 2.00 or less
than -2.0 in this case, .0556.
Click to get table.

23
Assignment 1

Calculate the z-score equivalents of these
systolic blood pressure values 100, 120, 130,
140, and 190, where the mean equals 126.2, and
the standard deviation equals 18.8. Click here
to enter an excel file to do the work.

24
Assignment 2

Calculate the probability of selecting a blood
pressure value at random greater than 160, where
the mean is 126.20 and the standard deviation is
18.80.
To determine this value, you first need to
calculate the equivalent z-score, as you did in
the prior assignment. Then, determine the area
under the curve represented by that z-score.
Finally, calculate the area beyond, or greater
than that value. The resulting value represents
probability. Click here for an excel file to work
on.

25
Assignment 3

Calculate the probability of selecting a blood
pressure value at random between 110 and 135
where the mean is 126.20 and the s.d is 18.80.
Click here for an excel file to work on.

We're about to make the transition from
descriptive to inferential statistics. The heart
of inferential statistics is "statistical
significance." This is the probability that the
your calculated value was "real" or happened by
random chance. Here's an example

Let's say that you're working with a group of
people to get their blood pressure down. One of
the things you do is put them on an exercise
program to reduce their weight and to strengthen
their cardiovascular system. All things being
equal, if your program is successful, their blood
pressure should come down.
If you take individuals' blood pressure at the
beginning of the study, calculate the mean, and
take it again and average it again at the end of
the study, the mean blood pressure at the end
should be lower than at the beginning. However,
how much does blood pressure have to be lowered
to assert with some confidence that our program
was successful? If the beginning group average
was 128 and the ending was 124, is this
difference large enough to claim programmatic
success? What if the ending average was 120? 110?
This is the issue central to statistical
significance is the difference between the two
values so close to the mean of the normal
distributionzerothat the probability of
selecting the value at random is too high to
accept as "real?" Or, is it so far from the mean
that it's out in one of the tails of the
distribution, where the chances of pulling out of
the hat of values randomly is very small, and,
therefore, more likely "real?"
Here are some ways that a difference in mean
values before and after your high blood pressure
occur by chance alone for any particular group
of people, their blood pressure might go down for
reasons other than our exercise program. Maybe
they had a high salt diet when they started and
cut down on their sodium intake. Maybe they were
experiencing a lot of stress and they got it
under control. Or, maybe their blood pressure
just went down unexpectedly.
Conversely, maybe your program actually worked.
If so, you should be able to select another,
similar group, conduct the same program, and find
similar a similar difference between starting and
ending blood pressure values. Not exactly the
same, but the difference should be fall in the
same general area on the normal curve. If you
conducted this program with 100 similar groups
and found about the same difference, see how
you'd be pretty confident that your program
actually worked? Well, you probably can only run
it once, so you need that difference value to
fall a good long way from the mean in order to
have confidence that your program worked. That's
what statistical significance does for you.

Remember the five blood pressure values that you
worked with for the last assignment. Pretend
that you conducted your study five times, and
those numbers from the assignment represented the
five studies.
Many researchers use a statistical significance
level of .05 as their critical level. This means
that if the value is statistically-significant,
it will occur by chance alone, less than 5 times
in 100.
Another way of looking at the .05 critical value
is that it has to fall into one of the two tails
of the normal distribution, and not from that big
bunch of scores in the middle. Why? Because there
are lots of scores in the middle, and you have a
very good chance of selecting one of them by
chance alone. So, if the mean difference score is
in the "bulge" of the distribution, the
probability that it happened by chance alone is
too great for you to assume your program worked.

So, on the normal curve, in order for a value to
be statistically-significant at the 95 level or
greater, you have to have a z-score of greater
than 1.96 or less than -1.96. That is, the value
has to come from the area to the left of -1.96 or
from the area to the right of 1.96. This is the
only way that you can reduce your odds that the
value you calculated occurred by random chance
alone less than 5 times in 100.
If the value you calculate is statistically-signif
icant at, let's say the .05 level, you write it
like this plt.05. "lt" stands for "less than." If
it's not statistically-significant, you write
n.s. or ns (meaning that pgt.05. "gt" stands for
"greater than).
The reason that we almost always consider both
sides of the normal curve is because our
calculated value might be greater or less than
the mean. For the blood pressure example, our
program may have failed so badly that we actually
caused the average ending blood pressure to
increase! We'll talk a bit more about one-tail vs
two-tail tests a little later.