Title: Numerically
1Chapter 3 Numerically Summarizing Data Descriptiv
e Statistics Section 3.1 Measures of Central
Tendency
2- Measures of Central Tendency
- Mean
- Median
- Mode
- Midrange
Measures of central tendency are numeric values
that locate, in some sense, the middle of a data
set.
We have all heard the term average. Most
generally when this term is used it is referring
to the mean, but one must be careful, because it
can refer to the mean, the median, or the mode.
Each measure gives very different information.
3 Recall A parameter is a descriptive measure of a
population. Greek letters A statistic is a
descriptive measure of a sample. Roman
Characters
4- Mean
- If x1, x2, , xN are the N observations of a
variable from a population, then the population
mean, ?, is
52) Median The median of a variable is the value
that lies in the middle of the data when arranged
in ascending order. That is, half the data are
below the median and half the data are above the
median. We use M to represent the median.
- Steps in Computing the Median of a Data Set
- Arrange the data in ascending order.
- Determine the number of observations, n.
- Determine the observation in the middle of the
data set. - a) If the number of observations is odd, then
the median is the data value that is exactly in
the middle of the data set. That is, the median
is the observation that lies in the
position. - b) If the number of observations is even, then
the median is the mean of the two middle
observations in the data set. That is, the
median is the mean of the data values that lie
in the and positions.
3 4 6 14 50
2 4 7 9 15 26
8
63) Mode The mode of a variable is the most
frequent observation of the variable that occurs
in the data set. If a data set has two values
that occur with the highest frequency, we say the
data are bimodal. If a data set has three or
more data values that occur with the highest
frequency, the data set is multimodal.
4) Midrange The midrange is the average of the
smallest and largest data value.
7Example You are given the following starting
salaries for five graduates from the business
college at Iowa State University 35000, 37000,
35000, 33000, 210000. Find the mean, median,
mode, and midrange.
x1 x2 x3 x4
x5
Mean
Median
- Sort
- 33000
- 35000
- 35000
- 37000
- 210000
2) n5 is odd
8Mode
35000
Midrange
What would be the most appropriate measure of
central tendency to report to give a good
indication for future graduates about what
starting salary to expect? Why?
Mode or Median - Resistant to outliers
(extreme values)
Notice the mean is the most sensitive to extreme
values, while the median is not. We say the
median is resistant to extreme values.
9Let us take a look at how the mean compares to
the median and mode in the various distributions.
Mean gt Median
Mode Median Mean
10Mean lt Median
Mean Median Mode
11Mean Median
Mode Median Mean
12When do we use the mean, median, and
mode? Mean When the data are quantitative and
the frequency distribution is roughly
symmetric Median When the data are quantitative
and the frequency distribution is skewed left or
skewed right Mode When most frequent
observation is desired measure of central
tendency or the data are qualitative
13Section 3.2 Measures of Dispersion
- Measures of Dispersion/Spread
- Range
- Interquartile Range (Sec 3.4, p.46)
- Variance
- Standard Deviation
In completely describing a distribution, we need
more than just the center of the distribution.
Measures of dispersion give an indication of how
much variability there exists in a data set.
(10, 10, 10, 10, 10) vs. (1, 6, 10, 14,
19) Same mean and median, but do they have the
same distribution?
14- Range
- The range, R, of a variable is the difference
between the largest value and the smallest data
value. That is, - Range R (Largest Data Value Smallest Data
Value)
2) Interquartile Range The interquartile range,
IQR, is the difference between the third and
first quartile. IQR Q3 Q1
3) Variance The variance is based upon the
difference between each observation and the mean.
It is calculated as a mean of the squared
deviations. The divisor used in the calculation
of this mean is dependent upon whether we are
calculating the population variance or the sample
variance.
15Deviation about the mean
What would be the value if I were to add all the
deviations? Why?
Equal to zero. Mean is arithmetic center,
observation above the mean are offset by
observations below the mean.
16The population variance of a variable is the sum
of the squared deviations about the population
mean divided by the number of observations in the
population, N. That is, it is the mean of the
squared deviations about the population mean.
The population variance is symbolically
represented by ?2.
Sigma squared
where x1, x2, , xN are the N observations in
the population and ? is the population mean.
Notice that the population variance, ?2, is a
parameter.
17An algebraically equivalent formula for computing
the population variance is
where ?xi2 means to square each observation and
then sum these squared values, and (?xi)2 means
to add all the observations and then square the
sum.
18The sample variance, s2, is computed by
determining the sum of the squared deviations
about the sample mean and dividing this result by
n 1. The formula for the sample variance from
a sample of size n is
Notice that the sample variance, s2, is a
statistic.
19An algebraically equivalent formula for computing
the sample variance is
where ?xi2 means to square each observation and
then sum these squared values, and (?xi)2 means
to add all the observations and then square the
sum.
Notice that s2 explains the variance of a sample
and is therefore a statistic.
Notice that the sample variance is obtained by
dividing by n 1. If we divided by n, as we
would expect, the sample variance would
consistently underestimate the population
variance.
20Whenever a statistic consistently overestimates
or underestimates a parameter, it is called
biased. To obtain an unbiased estimate of the
population variance, divide the sum of the
squared deviations about the mean by n 1.
n-1 observations have freedom to be any value,
but the nth observation has to the value which
forces the sum of the deviations about the mean
0.
214) Standard Deviation The population standard
deviation, ?, is obtained by taking the square
root of the population variance. That is,
Notice that ? is a parameter since it is
describing a measure of the population.
The sample standard deviation, s, is obtained by
taking the square root of the sample variance.
That is,
Notice that s is a statistic since it is
describing a measure of the sample.
22If the standard deviation is just the square root
of the variance, what information is the standard
deviation giving us that the variance is not?
Can you think of a reason why we use the standard
deviation over the variance?
Variance is in units2. Standard deviation is in
units. Easier to interpret.
Understanding Standard Deviation Recall the data
set from Chapter 2 on the three-year rate of
return on a mutual fund. Suppose we are
comparing two mutual funds that have the same
mean, but one has a standard deviation of 8 and
the other a standard deviation of 20. Which
would you invest your money in? Why?
8 has a closer distribution about the mean.
23Example Nine randomly selected students from a
section of STAT 104 measured their pulse. The
following data were obtained 76, 60, 60, 81,
72, 80, 80, 68, 73. Find the range, variance,
and standard deviation.
Range
Largest Smallest 81 60 21
Variance
3.8 -12.2 -12.2 8.8 -0.2 7.8 7.8 -4.2 0.8 0.2
14.44 148.84 148.84 77.44 0.04 60.84 60.84 17.64 0
.64 529.56
5776 3600 3600 6561 5184 6400 6400 4624 5329 47474
n 9
Why not zero?
Rounding error
650
Totals
24We have
Variance
Standard Deviation
25- The Empirical Rule
- If a distribution is roughly bell shaped, then
- Approximately 68 of the data will lie within
one standard deviation of the mean. - Approximately 95 of the data will lie within
two standard deviations of the mean. - Approximately 99.7 of the data will lie within
three standard deviations of the mean.
260.68
270.95
280.997
29Section 3.4 Measures of Position
Measures of position are used to describe the
relative position of a certain data value with
the entire set of data. Can you think of an
example where you have been given a measure of
position?
Median middle of the data
30The z-score represents the number of standard
deviation that a data value is from the mean.
It is obtained by subtracting the mean from the
data value and dividing this result by the
standard deviation. There is both a population
z-score and a sample z-score their formulas are
as follows
The z-score is unitless it has a mean of 0 and a
standard deviation of 1.
Z-scores provide a way of comparing apples to
oranges by converting variables with different
centers and/or spreads to variables with the same
center and spread.
31Example The average 20-29 year old man is 69.9
inches tall, with a standard deviation of 3.0
inches. The average 20-29 year old woman is 64.6
inches tall with a standard deviation of 2.8
inches. Who is relatively taller, a 75-inch man
or a 70-inch woman?
32Percentiles are the values of the variable that
divide a set of ranked data into 100 equal
subsets. Each set of data has 99 percentiles.
The kth percentile, denoted Pk, is a value such
that at most k percent of the data are smaller in
value than Pk and at most (100-k) percent of the
data are larger in value.
33Often, we are interested in knowing the
percentile to which a specific data value
corresponds.
- Finding the Percentile that Corresponds to a Data
Value - Arrange the data in ascending order
- Use the following formula to determine the
percentile of the score, x - Round this number to the nearest integer
34- Quartiles are the most common percentiles. They
divide the data into four equal parts. - Q1 represents the 1st quartile. It is also the
25th percentile. - Q2 represents the 2nd quartile. It is also the
50th percentile. - Note This is also the median.
- Q3 represents the 3rd quartile. It is also the
75th percentile.
Outliers are extreme observations in a data set.
Outliers distort both the mean and the standard
deviation since neither is a resistant measure.
Because these measures often form the basis for
most statistical inference, any conclusions drawn
from a data set that contains outliers can be
flawed.
We check for outliers using the interquartile
range.
35- Checking for Outliers by Using Quartiles
- Determine the first and third quartiles of the
data. - Compute the interquartile range. The
interquartile range or IQR is the difference
between the third and first quartile. - IQR Q3 Q1
- Determine the fences. Fences serve as cutoff
point for determining outliers. - Lower Fence Q1 1.5(IQR)
- Upper Fence Q3 1.5(IQR)
- If a data value is less than the lower fence or
greater than the upper fence, then it is
considered an outlier.
36Example The following data represent the number
of inches of rain in Chicago during the month of
April for 20 randomly selected years.
Sort Data
0.97 1.14 1.85 2.34 2.47 2.78 3.41 3.48 3.94 3.97
4.00 4.02 4.11 4.77 5.22 5.50 5.79 6.14 6.28 7.69
Recall
Q1
n 20
Find the quartiles
M Q2
Q3
37Find the 67th percentile
0.97 1.14 1.85 2.34 2.47 2.78 3.41 3.48 3.94 3.97
4.00 4.02 4.11 4.77 5.22 5.50 5.79 6.14 6.28 7.69
What percentile is represented by 6.28 inches of
rain?
90th percentile
14th
Are there any outliers in this data set?
IQR Q3 Q1
5.36 - 2.625 2.735
Lower Fence Q1 1.5(IQR) Upper Fence Q3
1.5(IQR)
2.625 4.1025 -1.4775 5.36 4.1025 9.4625
No outliers
38Section 3.5 The Five-Number Summary Boxplots
Exploratory Data Analysis This is the area of
statistics that looks at data in order to spot
any interesting results that might be concluded
from the data. The idea here is to draw graphs
of data and obtain measures of central tendency
and spread in order to form some conjectures
regarding the data.
Rather than numerically describing a distribution
via the mean and standard deviation, exploratory
data analysis summarizes a distribution by using
measures that are resistant to extreme
observations.
Such a measure would be the five number summary.
39- Five Number Summary
- Minimum
- Q1
- M
- Q3
- Maximum
M is the median, NOT the mean
We use the five number summary to construct a
boxplot.
- Drawing a Boxplot
- Determine the upper and lower fences.
- Draw vertical lines at each quartile. Enclose
these vertical lines in a box. - Label the lower and upper fences.
- Draw a line from the first quartile to the
smallest data value that is larger than the
lower fence. Draw a line from the third quartile
to the largest data value smaller than the upper
fence. - Any data values less than the lower fence or
greater than the upper fence are outliers and
are marked with an asterisk.
40- Distribution Shape Based upon Boxplot
- If the median is near the center of the box and
each of the horizontal lines is of approximately
equal length, then the distribution is roughly
symmetric. - If the median is to the left of the center of the
box or the right line is substantially longer
than the left line, the distribution is skewed
right. - If the median is to the right of the center of
the box or the left line is substantially longer
than the right line, the distribution is skewed
left.
41Example The following data represents the number
of grams of fat in breakfast meals offered at
McDonalds. 12, 23, 28, 2, 31, 37, 34, 15, 23,
38, 31, 16, 11, 8, 8, 17, 20 Find the five
number summary.
Sort
2 8 8 11 12 15 16 17 20 23 23 28 31 31 34 37 38
n 17
Min 2 Max 38
- Five Number Summary
- Minimum 2
- Q1 12
- M 20
- Q3 31
- Maximum 38
Q1
M
Q3
42Construct a boxplot
IQR 31 12 19 Lower Fence 12 1.5(19)
-16.5 Upper Fence 31 1.5(19) 59.5
- Five Number Summary
- Minimum 2
- Q1 12
- M 20
- Q3 31
- Maximum 38
Q1
Q3
M
Min
Max
Comment on the shape of the distribution
Roughly symmetric