Title: Describing Distributions with Numbers
1Chapter 2
- Describing Distributions with Numbers
2Numerical Summaries
- Center of distribution
- mean
- median
- Spread of distribution
- five-point summary ( interquartile range)
- standard deviation ( variance)
3Mean (Arithmetic Average)
- Traditional measure of center
- Notation (xbar)
- Sum the values and divide by the sample size (n)
4Mean Illustrative Example Metabolic Rate
- Data Metabolic rates, 7 men (cal/day)
- 1792 1666 1362 1614 1460 1867 1439
5Median (M)
- Half of the ordered values are less than or equal
to the median value - Half of the ordered values are greater than or
equal to the median value - If n is odd, the median is the middle ordered
value - If n is even, the median is the average of the
two middle ordered values
6Median
- Example 1 data 2 4 6
- Median (M) 4
- Example 2 data 2 4 6 8
- Median 5 (average of
4 and 6) - Example 3 data 6 2 4
- Median ? 2
- (order the values 2 4 6 , so Median 4)
7Location of the Median L(M)
- Location of the median L(M) (n1)2 ,where n
sample size. - Example If 25 data values are recorded, the
Median is located at position (251)/2 13 in
ordered array.
8Median Illustrative Example
- Data Metabolic rates, n 7
- 1792 1666 1362 1614 1460 1867 1439
L(M) (7 1) / 2 4 Ordered array 1362
1439 1460 1614 1666 1792 1867
?
median Value of median 1614
9Comparing the Mean Median
- Mean median when data are symmetrical
- Mean ? median when data skewed or have outlier
(mean pulled toward tail) while the median is
more resistant
If we switch this 1362 1439 1460 1614
1666 1792 1867 to this 1362 1439 1460
1614 1666 1792 9867 the median is still
1614 but the mean goes from 1600 to 2742.9
10Question
- The average salary at a high tech company is
250K / year - The median salary is 60K.
- How can this be?
- Answer There are some very highly paid
executives, but most of the workers make modest
salaries
11Spread Variability
- Variability ? the amount values spread above and
below the center - Can be measured in several ways
- range (rarely used)
- 5-point summary inter-quartile range
- variance and standard deviation
12Range
- Based on smallest (minimum) and largest (maximum)
values in the data set - Range max ? min
- The range is not a reliable measure of spread
(affected by outliers, biased)
13Quartiles
- Three numbers which divide the ordered data into
four equal sized groups. - Q1 has 25 of the data below it.
- Q2 has 50 of the data below it. (Median)
- Q3 has 75 of the data below it.
14Obtaining the Quartiles
- Order the data.
- Find the median
- This is Q2
- Look at the lower half of the data (those below
the median) - The median of this lower half Q1
- Look at the upper half of the data
- The median of this upper half Q3
15Illustrative example 10 ages
- AGE (years) values, ordered array (n 10)
- 05 11 21 24 27 28 30 42 50 52
- ? ?
? Q1 Q2
Q3 - Q1 21
- Q2 average of 27 and 28 27.5
- Q3 42
16Weight Data Sorted n 53 ?Median
L(M)(531)/227 ?? placing it at
165L(Q1)(261)/213.5 ?? placing it between 127
and 128 (127.5)L(Q3) 13.5 from the top ??
placing it between 185 and 185
Q1 127.5 Q2 165 Q3 185
17Weight DataQuartiles
10 0166 11 009 12 0034578 13 00359 14 08 15
00257 16 555 17 000255 18 000055567 19 245 20
3 21 025 22 0 23 24 25 26 0
Q1 127.5
Q2 165
Q3 185
18Five-Number Summary
- minimum 100
- Q1 127.5
- M 165
- Q3 185
- maximum 260
IQR gives spread of middle 50 of the data
19Boxplot
- Central box spans Q1 and Q3.
- A line in the box marks the median M.
- Lines extend from the box out to the minimum and
maximum.
20Weight Data Boxplot
21Quartile extrapolation
- Quartile divides data set into 4 segment bottom,
bottom middle, top middle, upper - With small data sets ? extrapolate values
- Illustrative data 2, 4, 6, 8
- 2 4 6 8
- Q1 Q2 Q3
-
- Q1 average of 2 and 4, which is 3
- Q2 average of 4 and 5, which is 5
- Q3 average of 6 and 8, which is 7
22Boxplots ? useful for comparing two groups (text
p. 39)
23Variances Standard Deviation
- The most common measures of spread
- Based on deviations around the mean
- Each data value has a deviation, defined as
24Fig 2.3 Metabolic Rate for 7 men, with their
mean () and two deviations shown
25Variance
- Find the mean
- Find the deviation of each value
- Square the deviations
- Sum the squared deviations we call this the sum
of squares, or SS - Divide the SS by n-1
- (gives typical squared deviation from mean)
26Variance Formula
27Standard Deviation Square root of the variance
28Variance and Standard DeviationIllustrative
Example
- Data Metabolic rates, 7 men (cal/day)
- 1792 1666 1362 1614 1460 1867 1439
29Variance and Standard Deviation Illustrative
Example (cont.)
Observations Deviations Squared deviations
1792 1792?1600 192 (192)2 36,864
1666 1666 ?1600 66 (66)2 4,356
1362 1362 ?1600 -238 (-238)2 56,644
1614 1614 ?1600 14 (14)2 196
1460 1460 ?1600 -140 (-140)2 19,600
1867 1867 ?1600 267 (267)2 71,289
1439 1439 ?1600 -161 (-161)2 25,921
sum 0 SS 214,870
30Variance and Standard Deviation Illustrative
Example (cont.)
Notes (1) Use standard deviation s for
descriptive purposes(2) Variance standard
deviation calculated by calculator or computer in
practice
31Summary Statistics
- Two main measures of central location
- Mean ( )
- Median (M)
- Two main measures of spread
- Standard deviation (s)
- 5-point summary (interquartile range)
32Choosing Summary Statistics
- Use the mean and standard deviation for
reasonably symmetric distributions that are free
of outliers. - Use the median and IQR (or 5-point summary) when
data are skewed or when outliers are present.
33Example Number of Books Read
L(M)(521)/226.5
M
34Illustrative example Books read
- 5-point summary 0, 1, 3, 5.5, 99Note highly
asymmetric distribution
xbar 7.06 s 14.43The mean and standard
deviation give false impression with asymmetric
data