Displaying Quantitative Data center, variability, shape, outliers

1 / 41
About This Presentation
Title:

Displaying Quantitative Data center, variability, shape, outliers

Description:

Some of the concepts we will discuss here ... Mark McGwire vs. Babe Ruth. Mark McGwire's home run counts 1987-2001 ... Babe Ruth's home run counts 1920 1934 ... – PowerPoint PPT presentation

Number of Views:184
Avg rating:3.0/5.0
Slides: 42
Provided by: scottreeve

less

Transcript and Presenter's Notes

Title: Displaying Quantitative Data center, variability, shape, outliers


1
Displaying Quantitative Data(center,
variability, shape, outliers)
  • Graphical (visual) displays for quantitative data
  • dot plots
  • histograms
  • stem plots
  • scatterplots

2
  • Some of the concepts we will discuss here are
    deliberately vague. Well get more precise later
    when we discuss numerical descriptive techniques.
  • Numerical descriptive techniques (later)
  • center mean, median
  • variability range, standard deviation
  • resistant measures five-number summary
  • boxplots (a graphical display of the five-number
    summary)

3
What strikes you as the most distinctive
difference among the distributions of scores in
classes A, B, C?
4
  • The center of a distribution is usually the most
    important aspect to notice and describe.
  • The center of a distribution might represent a
    typical value.
  • For now, we can describe the center of a
    distribution by the value with roughly half of
    the observations taking smaller values and half
    taking larger values.

5
What strikes you as the most distinctive
difference among the distributions of scores in
classes D, E, F?
6
  • A distributions variability is a second
    important feature.
  • For now, we can describe the spread of a
    distribution by giving the smallest and largest
    values.
  • When describing the variability or spread of a
    distribution, we may wish to leave off outliers.
    An individual value that falls outside the
    overall pattern is an outlier.

7
What strikes you as the most distinctive
difference among the distributions of scores in
classes G, H, I?
8
  • The shape of distribution can reveal much
    information. Distributions come in a limitless
    variety of shapes, but certain shapes arise often
    enough to have their own names.
  • Symmetric
  • one half is roughly a mirror image of the other
  • Right skewed
  • the distribution tails off toward larger values
  • Left skewed
  • the distribution tails off toward smaller values

9
  • Dr. Albert Barnes was a wealthy art collector who
    accumulated a large number of impressionist
    masterpieces the total exceeds 800 paintings.
    When Dr. Barnes died in 1951 he stated in his
    will that his collection was not to be allowed to
    tour. However, because of the deterioration of
    the exhibits home near Philadelphia, a judge
    ruled that the collection could go on tour to
    raise enough money to renovate the building.
  • Because of the size and value of the collection,
    it was predicted (correctly) that in each city a
    large number of people would come to view the
    paintings. Because space was limited, most
    galleries had to sell tickets that were valid at
    one time. To judge how many people to let in at a
    time, it was necessary to know the length of time
    people would spend at the exhibit longer times
    would dictate smaller audiences, shorter times
    would allow for sale of more tickets. Suppose
    that in one city the amount of time taken to view
    the complete exhibit by each of 400 people was
    measured and recorded.

10
113, 49, 62, 44, 42, 32, 43, 46, 54, 98, 64, 34,
61, 60, 52, 57, 42, 46, 69, 70, 36, 43, 30, 54,
47, 38, 55, 40, 70, 55, 70, 44, 29, 40, 48, 66,
54, 59, 94, 64, 45, 38, 70, 62, 48, 57, 84, 50,
27, 38, 55, 51, 62, 34, 54, 47, 58, 45, 61, 61,
47, 64, 34, 107, 70, 30, 61, 65, 44, 54, 62, 74,
41, 30, 88, 58, 59, 43, 63, 33, 51, 58, 48, 33,
36, 52, 29, 34, 66, 50, 45, 44, 47, 41, 39, 38,
106, 49, 35, 46, 42, 31, 41, 98, 40, 48, 42, 25,
33, 29, 66, 39, 30, 47, 43, 35, 30, 59, 45, 41,
31, 47, 26, 53, 40, 23, 79, 28, 78, 74, 42, 52,
53, 46, 40, 50, 90, 50, 37, 45 89, 39, 60, 44,
36, 57, 47, 78, 48, 37, 55, 44, 54, 59, 70, 60,
34, 32, 35, 48, 52, 53, 151, 43, 112, 44, 39, 53,
41, 70, 72, 32, 71, 63, 65, 49, 31, 32, 83. 37,
40, 64, 47, 38, 32, 49, 33, 78, 50, 35, 28, 39,
54, 41, 82, 32, 42, 43, 43, 57, 45, 88, 66, 53,
57, 46, 61, 53, 90, 28, 41, 74, 31, 107, 45, 50,
72, 75, 30, 54, 65, 73, 45, 58, 48, 62, 60, 92,
50, 43, 70, 33, 29, 40, 91, 49, 56, 39, 35, 24,
52, 41, 31, 63, 44, 57, 50, 42, 41, 27, 44, 46,
64, 39, 71, 42, 30, 109, 66, 41, 32, 51, 41, 56,
38, 80, 54, 60, 41, 33, 134, 71, 33, 63, 45, 63,
57, 64, 91, 91, 28, 98, 27, 102, 8, 44, 53, 71,
42, 31, 46, 55, 67, 41, 40, 67, 48, 70, 40, 71
28, 29, 40, 35, 58, 64, 33, 50, 82, 53, 33, 54,
85, 77, 67, 38, 28, 63, 45, 48, 34, 63, 42, 88,
42, 36, 36, 33, 52, 104, 68, 48, 85, 29, 51, 49,
60, 47, 63, 62, 82, 60, 50, 28, 78, 42, 121, 49,
125, 57, 93, 32, 52, 32, 44, 41, 38, 45, 36, 43,
29, 85, 51, 42, 73, 44, 79, 28, 70, 42, 45, 64,
38, 54, 41, 56, 46, 45, 28, 70, 47, 41, 35, 62,
33, 40, 35, 43, 81, 45, 43, 68, 58, 90, 63, 39,
44, 27, 46, 36
11
If you examine the 400 observations, you acquire
very little information. You may discover that
the smallest number is 23 and the largest number
is 151, but you will have learned very little
about how the numbers are distributed between
these two extremes. A histogram will help
describe how the data are distributed.
12
(No Transcript)
13
Guidelines for Selecting the Class Intervals
  • Number of Observations Number of Classes
  • Less than 50 5-7
  • 50-200 7-9
  • 200-500 9-10
  • 500-1000 10-11
  • 1000-5000 11-13
  • 5000-50000 13-17
  • More than 50,000 17-20
  • (rounded to some convenient value)

14
Stemplots
  • A stem plot is another way to display the
    distribution of data.
  • Separate each observation into a stem, consisting
    of all but the final (rightmost) digit, and a
    leaf, the final digit. Stems may have as many
    digits as needed, but each leaf contains only a
    single digit.
  • Write the stems in a vertical column with the
    smallest at the top, and draw a vertical line at
    the right of this column.
  • Write each leaf in the row to the right of its
    stem, in increasing order out from the stem.
  • A stem plot looks like a histogram turned on end.

15
Mark McGwire vs. Babe Ruth
  • Mark McGwires home run counts 1987-2001
  • 49, 32, 33, 39, 22, 42, 9, 9, 39, 52, 58, 70,
    65, 32, 29
  • Babe Ruths home run counts 1920 1934
  • 54, 59, 35, 41, 46, 25, 47, 60, 54, 46, 49, 46,
    41, 34, 22

16
(No Transcript)
17
(No Transcript)
18
ScatterplotsGraphical Displays of Associations
between Two Variables
  • In a university where calculus is a prerequisite
    for the statistics course, a sample of 100
    students was drawn. The marks for calculus and
    statistics were recorded for each student.
  • Explore the relationship between the marks in
    calculus and statistics.

19
The scatterplot shows a fairly strong positive
linear relationship between Calculus scores and
Statistics scores
20
  • Not all data sets are adequately described by a
    model for which the expectation is a straight
    line.
  • x amount of fertilizer
  • y yield (in bushels) of tomatoes
  • A modest amount of fertilizer may well enhance
    the crop yield, while too much fertilizer can be
    destructive.

21
  • X amt of fertilizer Y yield (in bushels)
  • 12 24
  • 5 18
  • 15 31
  • 17 33
  • 20 26
  • 14 30
  • 6 20
  • 23 25
  • 11 25
  • 13 27
  • 8 21
  • 18 29
  • 22 29
  • 25 26

22
(No Transcript)
23
Numerical descriptions of quantitative data
  • Allow us to be more precise in describing various
    characteristics of a data set
  • Critical to the development of statistical
    inference

24
Measures of Central Location
  • Mean (average)
  • Label the observations in a dataset x1, x2, . . .
    xn.
  • If n is the number of observations in a sample
    then the sample mean is given by
  • If n is the number of observations in a
    population then we denote the mean of the
    population by the Greek letter mu, µ.

25
  • Median
  • The median is calculated by placing all the
    observations in ascending order. The observation
    that falls in the middle is the median.
  • When there are an even number of observations,
    the median is the average of the two middle
    observations.
  • Example
  • A sample of 10 adults was asked to report the
    number of hours they spent on the Internet the
    previous month. Find the mean and median by
    hand.
  • 0, 7, 12, 5, 33, 14, 8, 0, 9, 22
  • Mean 110/10 11.0
  • Median (89)/2 8.5

26
Behavior of Mean and Median
  • Monthly internet usage data
  • 0, 0, 5, 7, 8, 9, 12, 14, 22, 33 Mean
    11.0, median 8.5.
  • Suppose the respondent who reported 33 hours on
    the internet actually reported 133 hours.
  • The high outlier pulls the mean internet usage
    from 11.0 to 21.0, but the median stays the same.
  • The mean is pulled in the direction of extreme
    observations while the median is unaffected.
  • The median is a resistant measure of center.

27
Mean, median, symmetry, and skewness
  • The mean and median of a symmetric distribution
    are close together.
  • In a right-skewed distribution, high observations
    pull the mean right of the median. The mean is
    pulled toward the long tail.
  • In a left-skewed distribution, low observations
    pull the mean left of the median and toward the
    long tail.

28
90 95 94 93 96 90 92
87 82 86 81 86 86 71
71 75 75 77 70 75 75
77 75 62 69 61 62 61
56 56 58 53 51 40 20
21 3 n Mean Median 37
69.51 75.00
29
Mode
  • Mode is another measure of center, but can be
    misleading.
  • Definition The mode of a data set is the value
    of the observation that occurs most frequently.
  • Comment 1 There may be more than one mode or no
    mode.
  • Comment 2 For categorical data where the
    ordering of the categories is not relevant, mean
    and median are not appropriate measures of
    center, but mode can be used.

30
Example
  • A sample of 100 individuals contains 15
    left-handers, 80 right-handers, and 5
    ambidextrous individuals.
  • Suppose the data entry is coded as follows 1
    right-handed, 2 left-handed, 3 ambidextrous.
  • The sample mean is 1.25 meaningless!
  • The mode is 1 right-handed.

31
Bimodal distribution
  • When the graphical display of a distribution
    shows two peaks, it is often described as
    bimodal. A bimodal distribution might be
    indicative of two natural groupings in the data.
  • In a bimodal distribution, numerical measures of
    center often do not describe the typical value.

32
Weighted Average
  • A weighted average is appropriate when the
    observations have unequal importance.
  • Example Higher credit courses carry more weight
    in your GPA.
  • 5 credit course A (4 points)
  • 5 credit course B (3 points)
  • 3 credit course C (2 points)
  • 3 credit course C (2 points)

33
Weighted average, continued
  • Formula for weighted mean
  • where wi is the weight of the ith
    observation xi
  • GPA 5(4) 5(3) 3(2) 3(2) 2.94
  • 16
  • Without weights, the GPA is 4 3 2 2
    2.75
  • 4

34
Measures of VariabilityHow much do the
observations spread out or vary among themselves?
  • Example
  • Bank 1 single waiting line that feeds three
    tellers
  • Bank 2 three individual lines, one for each
    teller
  • Data waiting times (in minutes) for a random
    sample of 10 customers at each bank

35
Bank 1 6.5 6.6 6.7 6.8 7.1 7.3 7.4
7.7 7.7 7.7 mean 7.15, median 7.20Bank
2 4.2 5.4 5.8 6.2 6.7 7.7 7.7 8.5 9.3 10.0
mean 7.15, median 7.20Without considering
variation, we might conclude that the waiting
times at the two banks are pretty much the
same.Range is the simplest measure of
variability.Range largest obs smallest
obsBank 1 range 7.7 6.5 1.2Bank 2
range 10.0 4.2 5.8
36
Range is the simplest measure of variability,
but is not always satisfactory.
The two data sets have approximately the same
range and the same mean but there is an obvious
difference in the data sets
The potencies in the second data set tend to be
more stable and cluster about the center of the
data. There is less variability, but range does
not show this.
37
  • Standard deviation is the most common measure
    used to describe spread. The standard deviation
    measures how far scores tend to be from the mean,
    on average.
  • The individual amounts by which the observations
    deviate from the mean are called deviations.
  • If the deviations tend to be large in magnitude,
    then the data is spread out and exhibits high
    variability.
  • If the deviations tend to be small in magnitude,
    then the data exhibits low variability, and the
    observations tend to be close to average.

38
  • To counteract the positive and negative
    deviations canceling out, we take the average of
    the squared deviations.
  • Standard deviation s
  • The square root brings the units of measurement
    back to that of the data.
  • Dividing by n-1 instead of n corrects the
    tendency to underestimate the population standard
    deviation, s.

39
Five-number summarya quick and resistant measure
of both center and spread
  • Minimum
  • Q1 first quartile
  • the median of the obs left of the overall
    median
  • Q2 second quartile median
  • Q3 third quartile
  • the median of the obs right of the overall
    median
  • Maximum
  • Definition interquartile range IQR Q3 Q1

40
BoxplotA graph of the five-number summary
  • A central box spans the quartiles.
  • A line in the box marks the median.
  • Lines extend from the box out to the smallest and
    largest observations that are not suspected
    outliers.
  • In a modified boxplot, observations that are
    suspected outliers are plotted individually.
  • Call an observation a suspected outlier if it
    falls more than 1.5 x IQR above Q3 or below Q1.

41
Side-by-side boxplots
Write a Comment
User Comments (0)