Title: Summarizing Data
1Summarizing Data
2Overview
- To frame our discussion, consider
3Outline
- Distribution
- Measures of Central Tendency
- Measures of Variability
4Distribution
- The distribution of a variable tells us the
values it takes, and how often it takes the
values. The distribution is a summary of the
frequency of individual values or ranges of
values for a variable.
5Description
- A distribution is described by its shape, its
center, and its spread.
6Histogram Shapes
symmetric unimodal
bimodal
positively skewed
negatively skewed
7Measures of Central Tendency
8Mode
- The mode of a set of measurements is the
measurement that occurs most often. The
measurement with the highest frequency.
9Median
The sample median, is the middle value in a
set of data that is arranged in ascending order.
For an even number of data points the median is
the average of the middle two.
Population median
10Mean
- The mean of a set of measurement is the sum of
the measurements divided by the total number of
measurements.
11The Mean
The average (mean) of the n numbers
12Three Different Shapes for a Population
Distribution
symmetric
positive skew
negative skew
13Summary
- Mode and median are not influenced by outliers.
- Outliers influence the mean.
- Mode is useful for qualitative and quantitative
data. - Median and mean are useful for quantitative data
only.
14Measures of Variability
- Range
- Percentile
- Interquartile Range
- Variance
- Standard Deviation
15Range
- The range of a set of measurements is the
difference between the largest and the smallest
measurement.
16Percentile
- The pth percentile of a set of ordered
measurements is the value that has p below it
and (100-p) above it.
17Interquartile Range
- The interquartile range of a set of measurements
is the difference between the upper and lower
quartiles. - IQR 75 percentile - 25 percentile
18Upper and Lower Fourths
After the n observations in a data set are
ordered from smallest to largest, the lower
(upper) fourth is the median of the smallest
(largest) half of the data, where the median
is included in both halves if n is odd. A
measure of the spread that is resistant to
outliers is the fourth spread fs upper
fourth lower fourth (IQR).
19Outliers
Any observation farther than 1.5fs from the
closest fourth is an outlier. An outlier is
extreme if it is more than 3fs from the nearest
fourth, and it is mild otherwise.
20Boxplots
upper fourth
lower fourth
Scale
median
extreme outliers
mild outliers
21Variance
- The variance of a set of n measurements is the
sum of the squared deviations divided by either n
or n-1. - The choice of n or n-1 depends on whether we are
dealing with a population or a sample from that
population.
22Sample Variance
Variance is a measure of the spread of the data.
The sample variance of the sample x1, x2, xn of
n values of X is given by
We refer to s2 as being based on n 1 degrees of
freedom.
23Standard Deviation
Standard deviation is a measure of the spread of
the data using the same units as the data.
The sample standard deviation is the square root
of the sample variance
24Formula for s2
An alternative expression for the numerator of s2
is
25Properties of s2
Let x1, x2,,xn be any sample and c be any
nonzero constant.
where is the sample variance of the xs and
is the sample variance of the ys.
26Standard Deviation
- The standard deviation of a set of measurements
is the positive square root of the variance.
27Example
28Distribution
29Distribution (3)
30Distribution (4)