Title: Describing Distributions with Numbers
1Describing Distributions with Numbers
2Does education pay?
- Do people with more education earn more?
- This table displays the median incomes for four
different educational groups - The median is the income level for which half the
group makes less and half the group makes more
it is the level of income that divides the group
into two groups of equal size
3Does education pay?
- These data come from 12,362 adults interviewed as
part of the CPS so there must be some
variability - In fact the highest income reported was 681,928!
- The medians compare the centers of the
distributions in each educational category we
now want to get some information about the spread
- This table displays the range covered by the
middle half of the incomes in each group - Provides some information about spread
4Does education pay?
- Theres a clear message from these two tables
people with more education make more money - HOWEVER, this observational study does not
demonstrate a cause-and-effect relationship - It is likely that wealthy and/or highly motivated
are more likely to both go to college and make
more money - Smart people are more likely to get more school,
but would probably make more money even if they
didnt get more school - There are lots of potential lurking variables in
this relationship
5Median and quartiles
- Our comparison of incomes demonstrated simple and
effective ways of describing the center and
spread of a distribution - The median is the midpoint of the distribution,
the value that separates the bottom half of the
distribution from the top half - Quartiles divide the distribution into four
ranges, each with one quarter of the
observations half of the observations lie
between the first and third quartiles, one
quarter lie below the first quartile, half below
the median, and three quarters below the third
quartile - Now we need a rule to translate these ideas into
numbers
6Example 1 Finding the median
- In the summer of 2001 Barry Bonds hit a record
number of homeruns for a season 73 - Here are the number of home runs he hit in each
season from 1986 to 2004
7Example 1 Finding the median
- To find the median of Barry Bonds home run
counts, first arrange them in increasing order - Then count off until you have half of the
observations - in our case 37 is the exact middle with 9 lower
and 9 higher season home run counts - When there are an off number of observations this
is easy
8Example 1 Finding the median
- What happens if there are an even number of
observations? - Lets look at Mark McGwires 16 years worth of
home run counts
- Now we take the average of the two middle values
as the median
9Median and quartiles
- Fast way to calculate the median in an ordered
list - Count up (n 1) / 2 places from the beginning
- Must have your list sorted to begin with!
- For Bonds
- n 19, so (n 1) / 2 10 and the median entry
in the list is the 10th entry - For McGwire
- n 16, so (n 1) / 2 8.5 and the median is
halfway between the 8th and 9th entries on
McGwires list - This rule is great when n is a very large number
- Note that this rule does not give the median
itself, but only the position of the median in
the ordered list
10Median and quartiles
11Median and quartiles
- Back to the incomes example that we started with
- The median income for those with a high school
diploma is 18,640 - Question, do most people in the high school
diploma group have an income right around
18,640, or are there a few with very different
incomes? - The simplest description of a distribution
contains measures of both the center and the
spread of the distribution - The median describes the center
- The quartiles are a natural description of the
spread - Heres a rule for calculating the quartiles
12Median and quartiles
13Example 2 Finding the quartiles
- For Bonds list of home runs
- For McGwires list of home runs
14Median and quartiles
- In practice computer software is usually used to
calculate the median and quartiles - Software may give slightly different results from
what we get using our rule - There are different formula for determining how
to divide up the space between two adjacent
entries in an ordered list - We have chosen the simple average rule, but some
software uses different rules
15The five-number summary and boxplots
- To complete our description of a distribution, we
add two more numbers the smallest (minimum) and
largest (maximum) value included in the
distribution - These tell use about the tails of the
distribution - Combining all five numbers gives us the
five-number summary of a distribution
16The five-number summary and boxplots
17The five-number summary and boxplots
- These five numbers give a reasonably complete
description of the center and spread of a
distribution - For Bonds home runs they are
- 16 25 37 45 73
- For McGwires home runs they are
- 3 25.5 36 50.5 70
18The five-number summary and boxplots
- The five-number summary leads to a new type of
graph the boxplot
19(No Transcript)
20The five-number summary and boxplots
- When looking at a boxplot, locate the medians
first and then look at the spread - We can see that Bonds and McGwires median
performance is similar (medians about the same) - But, Bonds distribution is less spread out than
McGwires (Bonds is more consistent) - You can draw boxplots either vertically or
horizontally - Be sure to put a scale (label the axis) on the
boxplot
21Example 3 Education and income
- Back to our education and income example from the
beginning - The boxplot on the next page summarizes the
distribution of income within each education
category - These data come from 112,362 persons interviewed
by the CPS - This is a slight variation on a boxplot
- Instead of plotting the absolute maximum and
minimum values, the boxplot uses the 5 and 95
points in the distribution instead - This suppresses extreme (and extremely rare)
outliers from influencing the boxplot too much
22(No Transcript)
23Example 3 Education and income
- The education income boxplot provides a clear
picture of how income varies by education - The median and the middle half move up steadily
as education increases - The bottom 5 stays about the same because there
are some people who have no income in all
education groups - The upper 95 shoots up rapidly with education
- Boxplots also give an indication of symmetry or
skewness - In left skewed distributions, the low extreme and
first quartile are farther from the median than
the third quartile - In right skewed distributions the opposite is true
24Mean and standard deviation
- The five-number summary is a very robust and
useful way to summarize distributions, but it is
not the most common - The mean and standard deviation are perhaps a
more common way to summarize a distribution - In practice both the five-number summary and the
mean and standard deviation are used
25Mean and standard deviation
- The mean is familiar to all of you already it
is the ordinary average of the values in the
distribution - The idea of standard deviation is to give the
average distance between the mean and the values
in the distribution the average deviation of the
values in the distribution from the average of
all the values - The standard deviation is calculated in a
slightly obscure way that we will show you but
wont worry about too much well let our
calculators or spreadsheets or statistics
packages calculate the standard deviation for us
26Mean and standard deviation
27Mean and standard deviation
28Example 4 Finding the mean and standard deviation
- To calculate the mean for Bonds home run numbers
- n 19
29Example 4 Finding the mean and standard
deviation
- To calculate the standard deviation for Bonds
home run numbers - (n -1) 18
30Example 4 Finding the mean and standard
deviation
31Mean and standard deviation
- In practice you use a calculator or spreadsheet
to calculate the mean and the standard deviation - For various reasons we calculate the average of
the deviations by dividing by (n 1) rather than
n - Most calculators give you the option of using n
or (n -1), be sure to use the (n 1) option - For our purposes it is more important to know
what the mean and standard deviation are and what
their properties are rather than to be able to
calculate them by hand
32Mean and standard deviation
33Example 5 Investing 101
- One of the key principles of investment is that
taking more risk yields greater average returns
over the long run - The risk associated with an investment is
related to how predictable the return on the
investment is - If the return is known exactly there is no risk
- If the return is unpredictable, there is some
risk - If the return is highly unpredictable, there is a
lot of risk - You could assess investments by looking at the
distributions of their yearly returns and asking
about both the center and spread of these
distributions
34Example 5 Investing 101
- Return distributions with high centers give
bigger average returns - Return distributions with bigger standard
deviations are riskier on average harder to
predict
35Choosing numerical descriptions
- How do we choose a numerical description for a
distribution? - The five-number summary is the best short
description for most distributions - The mean and standard deviation are harder to
understand and calculate, BUT they are more
common - How do the mean and median compare?
- They are both reasonable ideas for describing the
center of a distribution - The main difference is that the mean is strongly
influenced by extreme observations the median is
not
36Example 6 Mean versus median
37Example 6 Mean versus median
- The mean of the LA Lakers 2003 salaries is 4.4
million, and the median is 1.5 million - Why the big difference between the mean and
median? - The stemplot shows that the distribution is
highly right-skewed - Shaquille ONeal and Kobe Bryant make MUCH more
than the other members of the team - We can make the mean as big as we want by paying
Shaq more and more
38Choosing numerical descriptions
- Then mean and median of a symmetric distribution
are close to each other - In skewed distributions, the mean runs away from
the median toward the long tail - However, you have to think about more than just
symmetry and skewness when choosing a descriptor
for a distribution - The total number of observations times the mean
gives the overall total of the distribution,
which is useful sometimes - For example, the average price of a house in a
given town times the number of houses in the town
is the total value of the housing stock in the
town
39Choosing numerical descriptions
- The standard deviation is even more influenced by
extreme values than the mean - Quartiles are much less influenced by extreme
values - With skewed distributions, the two sides of the
distribution have different spreads - This makes it impossible for a single number like
the standard deviation to do a good job of
describing the spread of a skewed distribution - The five-number summary is a much better
description for skewed distributions
40Choosing numerical descriptions
41Choosing numerical descriptions
- Why bother with the mean and standard deviation?
- Because they are the natural and correct way to
describe the very important normal distribution
that we will meet next time - Remember that a graph is the absolute best
description of a distribution - Numerical descriptions are summaries of a
distribution that lack the detail that you can
see in a graph - Always start with a graph
42Summary
- To describe a distribution, start with a graph
- If you have a quantitative variable start with a
histogram or stemplot - Then add numbers to describe the center and
spread of the distribution - There are two common descriptors of the center
and spread - The five-number summary
- Median
- The two quartiles that define the middle half of
the distribution - The smallest and largest observations to describe
spread
43Summary
- The mean and standard deviation
- The mean is the average of the observations
- The standard deviation is a measure of the spread
as a kind of average distance from the mean - The mean and standard deviation can be changed a
lot by extreme values - The mean and median are close to each other for
symmetric distributions - In general use the five-number summary to
describe most distributions and the mean and
standard deviation only for symmetric
distributions