Describing Distributions with Numbers - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Describing Distributions with Numbers

Description:

Institute of Behavioral Science, University of Colorado at Boulder ... The mean of the LA Lakers' 2003 salaries is $4.4 million, and the median is $1.5 ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 44
Provided by: samuel59
Category:

less

Transcript and Presenter's Notes

Title: Describing Distributions with Numbers


1
Describing Distributions with Numbers
  • Chapter 12

2
Does education pay?
  • Do people with more education earn more?
  • This table displays the median incomes for four
    different educational groups
  • The median is the income level for which half the
    group makes less and half the group makes more
    it is the level of income that divides the group
    into two groups of equal size

3
Does education pay?
  • These data come from 12,362 adults interviewed as
    part of the CPS so there must be some
    variability
  • In fact the highest income reported was 681,928!
  • The medians compare the centers of the
    distributions in each educational category we
    now want to get some information about the spread
  • This table displays the range covered by the
    middle half of the incomes in each group
  • Provides some information about spread

4
Does education pay?
  • Theres a clear message from these two tables
    people with more education make more money
  • HOWEVER, this observational study does not
    demonstrate a cause-and-effect relationship
  • It is likely that wealthy and/or highly motivated
    are more likely to both go to college and make
    more money
  • Smart people are more likely to get more school,
    but would probably make more money even if they
    didnt get more school
  • There are lots of potential lurking variables in
    this relationship

5
Median and quartiles
  • Our comparison of incomes demonstrated simple and
    effective ways of describing the center and
    spread of a distribution
  • The median is the midpoint of the distribution,
    the value that separates the bottom half of the
    distribution from the top half
  • Quartiles divide the distribution into four
    ranges, each with one quarter of the
    observations half of the observations lie
    between the first and third quartiles, one
    quarter lie below the first quartile, half below
    the median, and three quarters below the third
    quartile
  • Now we need a rule to translate these ideas into
    numbers

6
Example 1 Finding the median
  • In the summer of 2001 Barry Bonds hit a record
    number of homeruns for a season 73
  • Here are the number of home runs he hit in each
    season from 1986 to 2004

7
Example 1 Finding the median
  • To find the median of Barry Bonds home run
    counts, first arrange them in increasing order
  • Then count off until you have half of the
    observations
  • in our case 37 is the exact middle with 9 lower
    and 9 higher season home run counts
  • When there are an off number of observations this
    is easy
  • The median, M 37

8
Example 1 Finding the median
  • What happens if there are an even number of
    observations?
  • Lets look at Mark McGwires 16 years worth of
    home run counts
  • Now we take the average of the two middle values
    as the median

9
Median and quartiles
  • Fast way to calculate the median in an ordered
    list
  • Count up (n 1) / 2 places from the beginning
  • Must have your list sorted to begin with!
  • For Bonds
  • n 19, so (n 1) / 2 10 and the median entry
    in the list is the 10th entry
  • For McGwire
  • n 16, so (n 1) / 2 8.5 and the median is
    halfway between the 8th and 9th entries on
    McGwires list
  • This rule is great when n is a very large number
  • Note that this rule does not give the median
    itself, but only the position of the median in
    the ordered list

10
Median and quartiles
11
Median and quartiles
  • Back to the incomes example that we started with
  • The median income for those with a high school
    diploma is 18,640
  • Question, do most people in the high school
    diploma group have an income right around
    18,640, or are there a few with very different
    incomes?
  • The simplest description of a distribution
    contains measures of both the center and the
    spread of the distribution
  • The median describes the center
  • The quartiles are a natural description of the
    spread
  • Heres a rule for calculating the quartiles

12
Median and quartiles
13
Example 2 Finding the quartiles
  • For Bonds list of home runs
  • For McGwires list of home runs

14
Median and quartiles
  • In practice computer software is usually used to
    calculate the median and quartiles
  • Software may give slightly different results from
    what we get using our rule
  • There are different formula for determining how
    to divide up the space between two adjacent
    entries in an ordered list
  • We have chosen the simple average rule, but some
    software uses different rules

15
The five-number summary and boxplots
  • To complete our description of a distribution, we
    add two more numbers the smallest (minimum) and
    largest (maximum) value included in the
    distribution
  • These tell use about the tails of the
    distribution
  • Combining all five numbers gives us the
    five-number summary of a distribution

16
The five-number summary and boxplots
17
The five-number summary and boxplots
  • These five numbers give a reasonably complete
    description of the center and spread of a
    distribution
  • For Bonds home runs they are
  • 16 25 37 45 73
  • For McGwires home runs they are
  • 3 25.5 36 50.5 70

18
The five-number summary and boxplots
  • The five-number summary leads to a new type of
    graph the boxplot

19
(No Transcript)
20
The five-number summary and boxplots
  • When looking at a boxplot, locate the medians
    first and then look at the spread
  • We can see that Bonds and McGwires median
    performance is similar (medians about the same)
  • But, Bonds distribution is less spread out than
    McGwires (Bonds is more consistent)
  • You can draw boxplots either vertically or
    horizontally
  • Be sure to put a scale (label the axis) on the
    boxplot

21
Example 3 Education and income
  • Back to our education and income example from the
    beginning
  • The boxplot on the next page summarizes the
    distribution of income within each education
    category
  • These data come from 112,362 persons interviewed
    by the CPS
  • This is a slight variation on a boxplot
  • Instead of plotting the absolute maximum and
    minimum values, the boxplot uses the 5 and 95
    points in the distribution instead
  • This suppresses extreme (and extremely rare)
    outliers from influencing the boxplot too much

22
(No Transcript)
23
Example 3 Education and income
  • The education income boxplot provides a clear
    picture of how income varies by education
  • The median and the middle half move up steadily
    as education increases
  • The bottom 5 stays about the same because there
    are some people who have no income in all
    education groups
  • The upper 95 shoots up rapidly with education
  • Boxplots also give an indication of symmetry or
    skewness
  • In left skewed distributions, the low extreme and
    first quartile are farther from the median than
    the third quartile
  • In right skewed distributions the opposite is true

24
Mean and standard deviation
  • The five-number summary is a very robust and
    useful way to summarize distributions, but it is
    not the most common
  • The mean and standard deviation are perhaps a
    more common way to summarize a distribution
  • In practice both the five-number summary and the
    mean and standard deviation are used

25
Mean and standard deviation
  • The mean is familiar to all of you already it
    is the ordinary average of the values in the
    distribution
  • The idea of standard deviation is to give the
    average distance between the mean and the values
    in the distribution the average deviation of the
    values in the distribution from the average of
    all the values
  • The standard deviation is calculated in a
    slightly obscure way that we will show you but
    wont worry about too much well let our
    calculators or spreadsheets or statistics
    packages calculate the standard deviation for us

26
Mean and standard deviation
27
Mean and standard deviation
28
Example 4 Finding the mean and standard deviation
  • To calculate the mean for Bonds home run numbers
  • n 19

29
Example 4 Finding the mean and standard
deviation
  • To calculate the standard deviation for Bonds
    home run numbers
  • (n -1) 18

30
Example 4 Finding the mean and standard
deviation
31
Mean and standard deviation
  • In practice you use a calculator or spreadsheet
    to calculate the mean and the standard deviation
  • For various reasons we calculate the average of
    the deviations by dividing by (n 1) rather than
    n
  • Most calculators give you the option of using n
    or (n -1), be sure to use the (n 1) option
  • For our purposes it is more important to know
    what the mean and standard deviation are and what
    their properties are rather than to be able to
    calculate them by hand

32
Mean and standard deviation
33
Example 5 Investing 101
  • One of the key principles of investment is that
    taking more risk yields greater average returns
    over the long run
  • The risk associated with an investment is
    related to how predictable the return on the
    investment is
  • If the return is known exactly there is no risk
  • If the return is unpredictable, there is some
    risk
  • If the return is highly unpredictable, there is a
    lot of risk
  • You could assess investments by looking at the
    distributions of their yearly returns and asking
    about both the center and spread of these
    distributions

34
Example 5 Investing 101
  • Return distributions with high centers give
    bigger average returns
  • Return distributions with bigger standard
    deviations are riskier on average harder to
    predict

35
Choosing numerical descriptions
  • How do we choose a numerical description for a
    distribution?
  • The five-number summary is the best short
    description for most distributions
  • The mean and standard deviation are harder to
    understand and calculate, BUT they are more
    common
  • How do the mean and median compare?
  • They are both reasonable ideas for describing the
    center of a distribution
  • The main difference is that the mean is strongly
    influenced by extreme observations the median is
    not

36
Example 6 Mean versus median
37
Example 6 Mean versus median
  • The mean of the LA Lakers 2003 salaries is 4.4
    million, and the median is 1.5 million
  • Why the big difference between the mean and
    median?
  • The stemplot shows that the distribution is
    highly right-skewed
  • Shaquille ONeal and Kobe Bryant make MUCH more
    than the other members of the team
  • We can make the mean as big as we want by paying
    Shaq more and more

38
Choosing numerical descriptions
  • Then mean and median of a symmetric distribution
    are close to each other
  • In skewed distributions, the mean runs away from
    the median toward the long tail
  • However, you have to think about more than just
    symmetry and skewness when choosing a descriptor
    for a distribution
  • The total number of observations times the mean
    gives the overall total of the distribution,
    which is useful sometimes
  • For example, the average price of a house in a
    given town times the number of houses in the town
    is the total value of the housing stock in the
    town

39
Choosing numerical descriptions
  • The standard deviation is even more influenced by
    extreme values than the mean
  • Quartiles are much less influenced by extreme
    values
  • With skewed distributions, the two sides of the
    distribution have different spreads
  • This makes it impossible for a single number like
    the standard deviation to do a good job of
    describing the spread of a skewed distribution
  • The five-number summary is a much better
    description for skewed distributions

40
Choosing numerical descriptions
41
Choosing numerical descriptions
  • Why bother with the mean and standard deviation?
  • Because they are the natural and correct way to
    describe the very important normal distribution
    that we will meet next time
  • Remember that a graph is the absolute best
    description of a distribution
  • Numerical descriptions are summaries of a
    distribution that lack the detail that you can
    see in a graph
  • Always start with a graph

42
Summary
  • To describe a distribution, start with a graph
  • If you have a quantitative variable start with a
    histogram or stemplot
  • Then add numbers to describe the center and
    spread of the distribution
  • There are two common descriptors of the center
    and spread
  • The five-number summary
  • Median
  • The two quartiles that define the middle half of
    the distribution
  • The smallest and largest observations to describe
    spread

43
Summary
  • The mean and standard deviation
  • The mean is the average of the observations
  • The standard deviation is a measure of the spread
    as a kind of average distance from the mean
  • The mean and standard deviation can be changed a
    lot by extreme values
  • The mean and median are close to each other for
    symmetric distributions
  • In general use the five-number summary to
    describe most distributions and the mean and
    standard deviation only for symmetric
    distributions
Write a Comment
User Comments (0)
About PowerShow.com