Title: Describing Data Numerically
1Chapter 3
- Describing Data Numerically
2Basic Summary Measurements
- Measures of center
- median
- mean
- mode
- Measures of position
- quartiles
- percentiles
- Measures of spread
- range
- interquartile range
- variance
- standard deviation
3Median
- Definition
- The median is the value of the middle term in a
data set that has been ranked in increasing order.
4Median Continued
- The calculation of the median consists of the
following two steps - Rank the data set in increasing order
- Find the middle term in a data set with n values.
The value of this term is the median.
5Median cont.
Value of Median for Ungrouped Data
6Example
The following data give the weight lost (in
pounds) by a sample of five members of a health
club at the end of two months of membership 10
5 19 8 3 Find the median.
7Solution
First, we rank the given data in increasing
order as follows 3 5 8 10 19
There are five observations in the data set.
Consequently, n 5 and
8Solution
- Therefore, the median is the value of the third
term in the ranked data. - 3 5 8 10 19
-
- The median weight loss for this sample of five
members of this health club is 8 pounds.
Median
9Median and Histogram
The median gives the center of a histogram,
with half the data values to the left of the
median and half to the right of the median. The
advantage of using the median as a measure of
central tendency is that it is not influenced by
outliers. Consequently, the median is preferred
over the mean as a measure of central tendency
for data sets that contain outliers.
10Median
- The median of a data set is a number such that
half the data values are above it and half the
data values are below it. - The median is the 50th percentile of the data
11Example of Finding Median
- Distance from the sum (in millions of miles) of
the nine planets - 36, 67, 93, 142, 484, 887, 1765, 2791, 3654
- What is the median distance from the sun?
-
- 484 million miles
12Another Example
- Average daily temperature in Chicago for January
through December (degrees F) - 41, 45, 54, 62, 69, 76, 79, 78, 73, 62, 53, 45
- What is the median monthly temperature?
- 41, 45, 45, 53, 54, 62, 62, 69, 73, 76, 78, 79
- (62 62) 2 62 degrees
13How to Find the Median
- The sample size n
- Find (n 1) /2
- the median is the (n1)/2 observation
- n 19 then (n1)/2 20/2 10 and the median is
the 10 observation - n 20 then (n1)/2 21/2 10.5 and the median
is the average of the 10th and 11 observation
14Two Examples
- Amount spent renting videos during 2004 for 15
households - 396 52 8 120 140 54 360 230 50 150 700
410 80 200 72 - What is the median?
- 140 (the 8th observation)
15Stem-and-Leaf for Video Data
- Prototype 120 120
- 0 8 50 52 54 72 80
- 1 20 40 50 Median 40
- 2 00 30
- 3 60 96
- 4 10
- High 700
16Second Example
- Number of stolen bases for National League in
2002 - 103, 86, 92, 74, 96, 71, 118, 177, 76, 104, 87,
116, 94, 71, 63, 86 - What is the median?
- (87 92)/2 89.5 , average of the 8th 9th
observations
17Stem-and-Leaf for Stolen Bases
- Prototype 7 1 71
- 6 3
- 7 1 1 4 6
- 8 6 6 7 median (87 92)/2 89.5
- 9 2 4 6
- 10 3 4
- 11 6 8
- High 177
18Warning
- When using Excel to compute medians you will
sometimes get a different solutions for Q1 and Q3
than the method described by your text. - Excel uses a more complicated interpolation
algorithm than your textbook for calculating
quartiles so be aware the values you compute by
hand may differ from those that Excel reports.
19Sample Mean
- The symbol for the sample mean is
- The sample mean is the average of the data
- The sample mean is the value where the histogram
balances
20Mean
The mean for ungrouped data is obtained by
dividing the sum of all values by the number of
values in the data set. Thus, Mean for
population data Mean for sample data
21Example
The table on the next slide gives the 2002
total payrolls of five Major League Baseball
(MLB) teams. Find the mean of the 2002 payrolls
of these five MLB teams.
22Table
23Solution
Thus, the mean 2002 payroll of these five MLB
teams was 78 million.
24Example
The following are the ages of all eight
employees of a small company 53
32 61 27 39 44 49 57 Find the mean age of
these employees.
25Solution
Thus, the mean age of all eight employees of this
company is 45.25 years, or 45 years and 3 months.
26Mean Continued
- Definition
- Values that are very small or very large relative
to the majority of the values in a data set are
called outliers or extreme values.
27Example
The table lists the 2000 populations (in
thousands) of the five Pacific states.
An outlier
28Discussion
- Notice that the population of California is very
large compared to the populations of the other
four states. Hence, it is an outlier. How does
the inclusion of this outlier affects the value
of the mean?
29Solution
- If we do not include the population of California
(the outlier) the mean population of the
remaining four states (Washington, Oregon,
Alaska, and Hawaii) is
30Solution
- Now, to see the impact of the outlier on the
value of the mean, we include the population of
California and find the mean population of all
five Pacific states. This mean is
31Mode
- Definition
- The mode is the value that occurs with the
highest frequency in a data set.
32Example
- The following data give the speeds (in miles per
hour) of eight cars that were stopped on I-95 for
speeding violations. - 77 69 74 81 71 68 74 73
- Find the mode.
33Solution
- In this data set, 74 occurs twice and each of the
remaining values occurs only once. Because 74
occurs with the highest frequency, it is the
mode. Therefore, - Mode 74 miles per hour
34Mode cont.
- A data set may have none or many modes,
whereas it will have only one mean and only one
median. - The data set with only one mode is called
unimodal. - The data set with two modes is called bimodal.
- The data set with more than two modes is called
multimodal.
35Example
- Last years incomes of five randomly selected
families were - 36,150. 95,750, 54,985, 77,490, 23,740.
- Find the mode.
36Solution
- Because each value in this data set occurs only
once, this data set contains no mode.
37Example
The prices of the same brand of television set
at eight stores are found to be 495, 486,
503, 495, 470, 505, 470, 499 Find the mode.
38Solution
- In this data set, each of the two values 495
and 470 occurs twice and each of the remaining
values occurs only once. - Therefore, this data set has two modes 495 and
470.
39Example
The ages of 10 randomly selected students from
a class are 21, 19, 27, 22, 29, 19, 25, 21, 22
and 30 Find the mode.
40Solution
This data set has three modes 19, 21 and 22.
Each of these three values occurs with a
(highest) frequency of 2.
41Mode cont.
One advantage of the mode is that it can be
calculated for both kinds of data, quantitative
and qualitative, whereas the mean and median can
be calculated for only quantitative data.
42Example
- The status of five students who are members of
the student senate at a college are senior,
sophomore, senior, junior, senior. - Find the mode.
43Solution
- Because senior occurs more frequently than the
other categories, it is the mode for this data
set. - We cannot calculate the mean and median for this
data set.
44Relationships among the Mean, Median, and Mode
For a symmetric histogram with one peak the
values of the mean, median, and mode are
identical, and they lie at the center of the
distribution.
45Relationships among the Mean, Median, and Mode
cont.
- For a histogram skewed to the right, the
value of the mean is the largest, that of the
mode is the smallest, and the value of the median
lies between these two. - Notice that the mode always occurs at the peak
point. - The value of the mean is the largest in this case
because it is sensitive to outliers that occur in
the right tail. - These outliers pull the mean to the right.
46Mean, median, and mode for a histogram skewed to
the right.
47Relationships among the Mean, Median, and Mode
cont.
If a histogram is skewed to the left,
the value of the mean is the smallest and that
of the mode is the largest, with the value of the
median lying between these two. Mean
Mode In this case, the outliers in the
left tail pull the mean to the left.
48Mean, median, and mode for a histogram skewed to
the left.
49When Does the Median Mean?
- If the histogram is symmetrical then median and
the median are close in value - If the histogram is skewed or there are outliers
the mean and median will have different values
50Effects of Skewness
- A histogram is skewed right if the outliers are
on the right (or high side) - A histogram is skewed left if the outliers are on
the left (or low side) - Skewed right mean median
- Skewed left mean
51Measures of Position
- The median is a measure of position - it marks
the midpoint or 50th percentile of the data - Other important benchmarks are the 25th and 75th
percentile which isolate the middle 50 of the
data - Other measures of position include other
percentiles such as the 10th, 80th, etc.
52Quartiles
- The 25th percentile is referred to as the first
quartile or symbolically Q1 - The 75th percentile is referred to as the third
quartile or symbolically Q3 - Sometimes the median is referred to as the second
quartile or Q2
53How to Find the Quartiles
The weight loss (in pounds) for 17 members of a
health club three months after joining are
5 8 10 7 2 6 3 9 4 11 7 5 9 4 6 11 5 Draw the
stem-and-leaf graph for the data Find the median
as well as Q1 and Q3
54Stem-and Leaf Graph for Weight Loss Data
Prototype 0 5 5 0 2 3 4 4 0 5 5
5 6 6 7 7 8 9 9 1 0 1 1 median 6 Q1
(4 5)/2 4.5 Q3 (9 9)/2 9
55Quartiles and Interquartile Range
- Definition
- Quartiles are three summery measures that divide
a ranked data set into four equal parts. The
second quartile is the same as the median of a
data set. The first quartile is the value of the
middle term among the observations that are less
than the median, and the third quartile is the
value of the middle term among the observations
that are greater than the median.
56Visually
Each of these portions contains 25 of the
observations of a data set arranged in increasing
order
25
25
25
25
Q1
Q2
Q3
57Quartiles and Interquartile Range cont.
- Calculating Interquartile Range
- The difference between the third and first
quartiles gives the interquartile range that is, - IQR Interquartile range Q3 Q1
58Example
- The following are the ages of nine employees of
an insurance company - 47 28 39 51 33 37 59 24 33
- Find the values of the three quartiles. Where
does the age of 28 fall in relation to the ages
of the employees? - Find the interquartile range.
59Solution
a)
Values less than the median
Values greater than the median
The age of 28 falls in the lowest 25 of the ages.
60Solution
b) IQR Interquartile range Q3 Q1
49 30.5 18.5 years
61BOX-AND-WHISKER PLOT
- Definition
- A plot that shows the center, spread, and
skewness of a data set. It is constructed by
drawing a box and two whiskers that use the
median, the first quartile, the third quartile,
and the smallest and the largest values in the
data set between the lower and the upper inner
fences.
62Example
- The following data are the incomes (in thousands
of dollars) for a sample of 12 households. - 35 29 44 72 34 64 41 50 54 104
39 58 - Construct a box-and-whisker plot for these data.
63Solution
- Step 1.
- 29 34 35 39 41 44 50 54 58 64
72 104 - Median (44 50) / 2 47
- Q1 (35 39) / 2 37
- Q3 (58 64) / 2 61
- IQR Q3 Q1 61 37 24
64Solution Continued
- Step 2.
- 1.5 x IQR 1.5 x 24 36
- Lower inner fence Q1 36 37 36 1
- Upper inner fence Q3 36 61 36 97
65Solution Continued
- Step 3.
- Smallest value within the two inner fences 29
- Largest value within the two inner fences 72
66Solution Continued
First quartile
Third quartile
Median
Income
67Solution Continued
First quartile
Third quartile
An outlier
Median
Largest value within two inner fences
Smallest value within the two inner fences
?
Income
68Five Number Summary
- Another very convenient way to graph quantitative
data is a boxplot which uses 5 numbers to
summarize the data - Minimum value
- 25th percentile (1st quartile) Q1
- 50th percentile (2nd quartile or median) Q2
- 75th percentile (3rd quartile) Q3
- Maximum value
69Why Boxplots?
- Present information more compactly than
histograms - Easier to make comparisons among several data
sets
70Main Components of a Boxplot
The boxplot represents the data of a random
sample of women who took an exam in elementary
statistics
71 lower quartile is 76.61 left side of the
box upper quartile is 89.59 right side of the
box median is 84.70 middle line of the
box fences bound all the data except for outliers
72The Interquartile Range
The interquartile range is a measure of how
spread out the middle 50 of the data
is Interquartile range (IQR) Upper Quartile -
Lower Quartile IQR 89.59 - 76.61 12.98 So,
there is a spread of about 13 points in the
middle 50 of the exam scores
73Outliers
Compute lower quartile - 1.5 (IQR) 76.76 -
1.5(12.98) 57.29 any data value below 57.29 is
a low outlier upper quartile 1.5(IQR) 89.59
1.5(12.98) 109.06 any data value above 109.06
is a high outlier
74Fences
Lower fence is the smallest data value that is
not an outlier Upper fence is the largest data
value that is not an outlier
75Calorie Content of Major Brands of Hotdogs