Title: Displaying Quantitative Data center, variability, shape, outliers
1Displaying Quantitative Data(center,
variability, shape, outliers)
- Graphical (visual) displays for quantitative data
- dot plots
- histograms
- stem plots
- scatterplots
2- Some of the concepts we will discuss here are
deliberately vague. Well get more precise later
when we discuss numerical descriptive techniques. - Numerical descriptive techniques (later)
- center mean, median
- variability range, standard deviation
- resistant measures five-number summary
- boxplots (a graphical display of the five-number
summary)
3What strikes you as the most distinctive
difference among the distributions of scores in
classes A, B, C?
4- The center of a distribution is usually the most
important aspect to notice and describe. - The center of a distribution might represent a
typical value. - For now, we can describe the center of a
distribution by the value with roughly half of
the observations taking smaller values and half
taking larger values.
5What strikes you as the most distinctive
difference among the distributions of scores in
classes D, E, F?
6- A distributions variability is a second
important feature. - For now, we can describe the spread of a
distribution by giving the smallest and largest
values. - When describing the variability or spread of a
distribution, we may wish to leave off outliers.
An individual value that falls outside the
overall pattern is an outlier.
7What strikes you as the most distinctive
difference among the distributions of scores in
classes G, H, I?
8- The shape of distribution can reveal much
information. Distributions come in a limitless
variety of shapes, but certain shapes arise often
enough to have their own names. - Symmetric
- one half is roughly a mirror image of the other
- Right skewed
- the distribution tails off toward larger values
- Left skewed
- the distribution tails off toward smaller values
9- Dr. Albert Barnes was a wealthy art collector who
accumulated a large number of impressionist
masterpieces the total exceeds 800 paintings.
When Dr. Barnes died in 1951 he stated in his
will that his collection was not to be allowed to
tour. However, because of the deterioration of
the exhibits home near Philadelphia, a judge
ruled that the collection could go on tour to
raise enough money to renovate the building. - Because of the size and value of the collection,
it was predicted (correctly) that in each city a
large number of people would come to view the
paintings. Because space was limited, most
galleries had to sell tickets that were valid at
one time. To judge how many people to let in at a
time, it was necessary to know the length of time
people would spend at the exhibit longer times
would dictate smaller audiences, shorter times
would allow for sale of more tickets. Suppose
that in one city the amount of time taken to view
the complete exhibit by each of 400 people was
measured and recorded.
10113, 49, 62, 44, 42, 32, 43, 46, 54, 98, 64, 34,
61, 60, 52, 57, 42, 46, 69, 70, 36, 43, 30, 54,
47, 38, 55, 40, 70, 55, 70, 44, 29, 40, 48, 66,
54, 59, 94, 64, 45, 38, 70, 62, 48, 57, 84, 50,
27, 38, 55, 51, 62, 34, 54, 47, 58, 45, 61, 61,
47, 64, 34, 107, 70, 30, 61, 65, 44, 54, 62, 74,
41, 30, 88, 58, 59, 43, 63, 33, 51, 58, 48, 33,
36, 52, 29, 34, 66, 50, 45, 44, 47, 41, 39, 38,
106, 49, 35, 46, 42, 31, 41, 98, 40, 48, 42, 25,
33, 29, 66, 39, 30, 47, 43, 35, 30, 59, 45, 41,
31, 47, 26, 53, 40, 23, 79, 28, 78, 74, 42, 52,
53, 46, 40, 50, 90, 50, 37, 45 89, 39, 60, 44,
36, 57, 47, 78, 48, 37, 55, 44, 54, 59, 70, 60,
34, 32, 35, 48, 52, 53, 151, 43, 112, 44, 39, 53,
41, 70, 72, 32, 71, 63, 65, 49, 31, 32, 83. 37,
40, 64, 47, 38, 32, 49, 33, 78, 50, 35, 28, 39,
54, 41, 82, 32, 42, 43, 43, 57, 45, 88, 66, 53,
57, 46, 61, 53, 90, 28, 41, 74, 31, 107, 45, 50,
72, 75, 30, 54, 65, 73, 45, 58, 48, 62, 60, 92,
50, 43, 70, 33, 29, 40, 91, 49, 56, 39, 35, 24,
52, 41, 31, 63, 44, 57, 50, 42, 41, 27, 44, 46,
64, 39, 71, 42, 30, 109, 66, 41, 32, 51, 41, 56,
38, 80, 54, 60, 41, 33, 134, 71, 33, 63, 45, 63,
57, 64, 91, 91, 28, 98, 27, 102, 8, 44, 53, 71,
42, 31, 46, 55, 67, 41, 40, 67, 48, 70, 40, 71
28, 29, 40, 35, 58, 64, 33, 50, 82, 53, 33, 54,
85, 77, 67, 38, 28, 63, 45, 48, 34, 63, 42, 88,
42, 36, 36, 33, 52, 104, 68, 48, 85, 29, 51, 49,
60, 47, 63, 62, 82, 60, 50, 28, 78, 42, 121, 49,
125, 57, 93, 32, 52, 32, 44, 41, 38, 45, 36, 43,
29, 85, 51, 42, 73, 44, 79, 28, 70, 42, 45, 64,
38, 54, 41, 56, 46, 45, 28, 70, 47, 41, 35, 62,
33, 40, 35, 43, 81, 45, 43, 68, 58, 90, 63, 39,
44, 27, 46, 36
11If you examine the 400 observations, you acquire
very little information. You may discover that
the smallest number is 23 and the largest number
is 151, but you will have learned very little
about how the numbers are distributed between
these two extremes. A histogram will help
describe how the data are distributed.
12(No Transcript)
13Guidelines for Selecting the Class Intervals
- Number of Observations Number of Classes
- Less than 50 5-7
- 50-200 7-9
- 200-500 9-10
- 500-1000 10-11
- 1000-5000 11-13
- 5000-50000 13-17
- More than 50,000 17-20
- (rounded to some convenient value)
14Stemplots
- A stem plot is another way to display the
distribution of data. - Separate each observation into a stem, consisting
of all but the final (rightmost) digit, and a
leaf, the final digit. Stems may have as many
digits as needed, but each leaf contains only a
single digit. - Write the stems in a vertical column with the
smallest at the top, and draw a vertical line at
the right of this column. - Write each leaf in the row to the right of its
stem, in increasing order out from the stem. - A stem plot looks like a histogram turned on end.
15Mark McGwire vs. Babe Ruth
- Mark McGwires home run counts 1987-2001
- 49, 32, 33, 39, 22, 42, 9, 9, 39, 52, 58, 70,
65, 32, 29 - Babe Ruths home run counts 1920 1934
- 54, 59, 35, 41, 46, 25, 47, 60, 54, 46, 49, 46,
41, 34, 22
16(No Transcript)
17(No Transcript)
18ScatterplotsGraphical Displays of Associations
between Two Variables
- In a university where calculus is a prerequisite
for the statistics course, a sample of 100
students was drawn. The marks for calculus and
statistics were recorded for each student. - Explore the relationship between the marks in
calculus and statistics.
19The scatterplot shows a fairly strong positive
linear relationship between Calculus scores and
Statistics scores
20- Not all data sets are adequately described by a
model for which the expectation is a straight
line. - x amount of fertilizer
- y yield (in bushels) of tomatoes
- A modest amount of fertilizer may well enhance
the crop yield, while too much fertilizer can be
destructive.
21- X amt of fertilizer Y yield (in bushels)
- 12 24
- 5 18
- 15 31
- 17 33
- 20 26
- 14 30
- 6 20
- 23 25
- 11 25
- 13 27
- 8 21
- 18 29
- 22 29
- 25 26
22(No Transcript)
23Numerical descriptions of quantitative data
- Allow us to be more precise in describing various
characteristics of a data set - Critical to the development of statistical
inference
24Measures of Central Location
- Mean (average)
- Label the observations in a dataset x1, x2, . . .
xn. - If n is the number of observations in a sample
then the sample mean is given by - If n is the number of observations in a
population then we denote the mean of the
population by the Greek letter mu, µ.
25- Median
- The median is calculated by placing all the
observations in ascending order. The observation
that falls in the middle is the median. - When there are an even number of observations,
the median is the average of the two middle
observations. - Example
- A sample of 10 adults was asked to report the
number of hours they spent on the Internet the
previous month. Find the mean and median by
hand. - 0, 7, 12, 5, 33, 14, 8, 0, 9, 22
- Mean 110/10 11.0
- Median (89)/2 8.5
26Behavior of Mean and Median
- Monthly internet usage data
- 0, 0, 5, 7, 8, 9, 12, 14, 22, 33 Mean
11.0, median 8.5. - Suppose the respondent who reported 33 hours on
the internet actually reported 133 hours. - The high outlier pulls the mean internet usage
from 11.0 to 21.0, but the median stays the same. - The mean is pulled in the direction of extreme
observations while the median is unaffected. - The median is a resistant measure of center.
27Mean, median, symmetry, and skewness
- The mean and median of a symmetric distribution
are close together. - In a right-skewed distribution, high observations
pull the mean right of the median. The mean is
pulled toward the long tail. - In a left-skewed distribution, low observations
pull the mean left of the median and toward the
long tail.
28 90 95 94 93 96 90 92
87 82 86 81 86 86 71
71 75 75 77 70 75 75
77 75 62 69 61 62 61
56 56 58 53 51 40 20
21 3 n Mean Median 37
69.51 75.00
29Mode
- Mode is another measure of center, but can be
misleading. - Definition The mode of a data set is the value
of the observation that occurs most frequently. - Comment 1 There may be more than one mode or no
mode. -
- Comment 2 For categorical data where the
ordering of the categories is not relevant, mean
and median are not appropriate measures of
center, but mode can be used.
30Example
- A sample of 100 individuals contains 15
left-handers, 80 right-handers, and 5
ambidextrous individuals. - Suppose the data entry is coded as follows 1
right-handed, 2 left-handed, 3 ambidextrous. - The sample mean is 1.25 meaningless!
- The mode is 1 right-handed.
31Bimodal distribution
- When the graphical display of a distribution
shows two peaks, it is often described as
bimodal. A bimodal distribution might be
indicative of two natural groupings in the data. - In a bimodal distribution, numerical measures of
center often do not describe the typical value.
32Weighted Average
- A weighted average is appropriate when the
observations have unequal importance. - Example Higher credit courses carry more weight
in your GPA. - 5 credit course A (4 points)
- 5 credit course B (3 points)
- 3 credit course C (2 points)
- 3 credit course C (2 points)
33Weighted average, continued
- Formula for weighted mean
- where wi is the weight of the ith
observation xi - GPA 5(4) 5(3) 3(2) 3(2) 2.94
- 16
- Without weights, the GPA is 4 3 2 2
2.75 - 4
34Measures of VariabilityHow much do the
observations spread out or vary among themselves?
- Example
- Bank 1 single waiting line that feeds three
tellers - Bank 2 three individual lines, one for each
teller - Data waiting times (in minutes) for a random
sample of 10 customers at each bank
35Bank 1 6.5 6.6 6.7 6.8 7.1 7.3 7.4
7.7 7.7 7.7 mean 7.15, median 7.20Bank
2 4.2 5.4 5.8 6.2 6.7 7.7 7.7 8.5 9.3 10.0
mean 7.15, median 7.20Without considering
variation, we might conclude that the waiting
times at the two banks are pretty much the
same.Range is the simplest measure of
variability.Range largest obs smallest
obsBank 1 range 7.7 6.5 1.2Bank 2
range 10.0 4.2 5.8
36Range is the simplest measure of variability,
but is not always satisfactory.
The two data sets have approximately the same
range and the same mean but there is an obvious
difference in the data sets
The potencies in the second data set tend to be
more stable and cluster about the center of the
data. There is less variability, but range does
not show this.
37- Standard deviation is the most common measure
used to describe spread. The standard deviation
measures how far scores tend to be from the mean,
on average. - The individual amounts by which the observations
deviate from the mean are called deviations. - If the deviations tend to be large in magnitude,
then the data is spread out and exhibits high
variability. - If the deviations tend to be small in magnitude,
then the data exhibits low variability, and the
observations tend to be close to average.
38- To counteract the positive and negative
deviations canceling out, we take the average of
the squared deviations. - Standard deviation s
- The square root brings the units of measurement
back to that of the data. - Dividing by n-1 instead of n corrects the
tendency to underestimate the population standard
deviation, s.
39Five-number summarya quick and resistant measure
of both center and spread
- Minimum
- Q1 first quartile
- the median of the obs left of the overall
median - Q2 second quartile median
- Q3 third quartile
- the median of the obs right of the overall
median - Maximum
- Definition interquartile range IQR Q3 Q1
40BoxplotA graph of the five-number summary
- A central box spans the quartiles.
- A line in the box marks the median.
- Lines extend from the box out to the smallest and
largest observations that are not suspected
outliers. - In a modified boxplot, observations that are
suspected outliers are plotted individually. - Call an observation a suspected outlier if it
falls more than 1.5 x IQR above Q3 or below Q1.
41Side-by-side boxplots