Title: Numerical Methods for Describing Data
1Chapter 4
- Numerical Methods for Describing Data
24.1 Describing the Center of a Data Set
- Sample Mean
- x the variable for which we have sample data
- n the number of observations in the sample
(sample size) - x1, x2, . . ., xn the observations in the
sample - The sample mean
- Populations Mean
- The population mean, denoted by µ, is the
average of all x values in the entire population
34.1 Describing the Center of a Data Set
- Example Traumatic knee dislocation often
requires surgery to repair ruptured ligaments.
One measure of recovery is range of motion
(measured as the angle formed when, starting with
the leg straight, the knee is bent as far as
possible). The following are postsurgical range
of motion (in degrees) for a sample of 13
patients - 154, 142, 137, 133, 122, 126, 135, 135, 108,
120, 127, 134, 122 - Find the sample mean range of motion.
4Example Three Samples from the Population of All
US Counties (x number of residents)
5Example Three Samples from the Population of All
US Counties (x number of residents)
- Population
- The population in US is 293,655,404 (2004
census), and there are 3137 counties in US. - Population Mean
- µ ?
Mean ?
The mean of Sample 3 is 23,643.6. The population
mean is 93,610.27
6Example Number of Visits to a Class Web Site
The Mean
- Forty students were enrolled in a section of
statistics class. One month after the course
began, the instructor requested a report that
indicated how many times each student had
accessed the class web site. The 40 observations
were listed on the right. - Find the mean.
- Does the mean well represent the center of the
data set?
Answer 2. 23.10 is not a very representative
value for the center of this sample, because it
is larger than most of the observations in the
data. The outlying values 331, and, to a lesser
extent, 84 have a substantial impact on the mean.
74.1 Describing the Center of a Data Set
- The sample median is obtained by
- first ordering the n observations from smallest
to largest (with any repeated values included, so
that every sample observation appear in the
ordered list). - Then
- sample median the single middle value if n is
odd, OR - the average of the middle two values if n is
even. - Examples (a) The median of the data set 5, 3,
7, 9, 11, 4, 16 is 7, which is the single middle
value of the ordered observation 3, 4, 5, 7, 9,
11, 16 (In this sample, n 7 ) - (b) The median of the data set 7, 3, 9, 5, 11,
4, 16, 10 is 8, which is the average of the two
middle values of the ordered observation 3, 4,
5, 7, 9, 10, 11, 16. ( ½ (7 9 ) 8 )
8Example Number of Visits to a Class Web Site
The Median
- Forty students were enrolled in a section of
statistics class. One month after the course
began, the instructor requested a report that
indicated how many times each student had
accessed the class web site. The 40 observations
were listed on the right. - Find the median.
- Does the median better represent the center of
the data set than the mean?
Answer on the next slide
9Solution to the Example Number of Visits to a
Class Web Site The Median
- First arrange the data from the smallest to the
largest
The median is the average of the middle two
values (13 13)/2 13. The median better
represents the center of this data set than the
mean (23.10).
10Comparing the Mean and the Median
- The mean can be sensitive to even a single value
that lies far above or below the rest of the data
(outliers). Therefore, the mean for the number of
website visit is not a very representative value
of that sample because of the outlying value 331. - The median is quite insensitive to outliers.
- If the histogram of the data set is symmetric,
mean median.
114.2 Describing Variability in a Data Set
- The range of a data set is defined as
- range largest observation smallest
observations - The n deviations from the sample mean are the
differences - Except for the effects of rounding in computing
in deviations, it is always true that
124.2 Describing Variability in a Data Set
- The sample variance, denoted by s2, is defined by
-
- The sample standard deviation is the positive
square root of the sample variance and is denoted
by s. -
The mean, median, sample variance, and standard
deviation are all available in Excel by
clicking Data ? Data Analysis ? Descriptive
Statistics In the dialog box, choose Summary
statistics.
134.2 Describing Variability in a Data Set
- Example Research by FDA shows that acrylamide (a
possible cancer-causing substance) forms in high
carbohydrate foods cooked at high temperature and
that acrylamide level can vary widely even within
the same brand of food. FDA scientists analyzed
McDonalds French fries purchased at seven
different locations and found the acrylamide
levels shown on the right - What is the range?
- What are the deviations?
- What is the sample variance?
- What is the standard deviation?
Answers on the next slide.
14Answers to the example of acrylamide data 1.
Range 497 155 342
(See the table on the right.)
15Number of Visits to a Class Web Site
- Forty students were enrolled in a section of
statistics class. One month after the course
began, the instructor requested a report that
indicated how many times each student had
accessed the class web site. The 40 observations
were listed on the right. - Find the sample variance and standard
deviation. - We are using Excel to solve the problem on next
three slides.
16- Input data
- Click Data
- Click Data
- Analysis
- Choose
- Descriptive
- Statistics
- Click OK
17- A dialog
- box will
- open.
- Enter
- A1A40
- in the
- Input
- Range.
- Check
- Summary
- statistics.
- Click OK.
18- Excel gives the results for Mean, Median, Sample
Variance, Standard Deviation and - many others. (you may need to adjust the width of
columns.)
19Describing Variability in a Data Set
Interquartile Range
- Lower quartile median of the lower half of the
sample - Upper quartile median of the upper half of the
sample - (If n is odd, the median of the entire sample is
excluded from both halves.) - The interquartile range (iqr), a measure of
variability that is not as sensitive to the
presence of outliers as the standard deviation,
is given by - iqr upper quartile lower quartile
20Example Hospital Cost-to-Charge Ratios
- The cost-to-charge ratio is computed as the ratio
of the actual cost of care to what the hospital
actually bills, and the ratio is usually
expressed as a percentage. A cost-to-charge ratio
of 60 means that the actual cost is 60 of what
was billed. The ratios for 31 hospitals in Oregon
for inpatient services published in the Oregon
Department of Health in 2002 were - 68, 76, 60, 88, 69, 80, 75, 67, 71, 100, 63,
62 - 71, 74, 64, 48, 100, 72, 65, 50, 72, 100, 63,
45 - 54, 60, 75, 57, 74, 84, 83
- Find the lower quartile, upper quartile and
interquartile range (iqr).
21Solution to the Example Hospital Cost-to-Charge
Ratios
- Ordered data
- Lower half 45 48 50 54 57 60 60 62 63 63 64 65
67 68 69 - Median 71
- Upper half 71 72 72 74 74 75 75 76 80 83 84 88
100 100 100 - Lower quartile 62
- Upper quartile 76
- iqr 76 62 14
- The mean 70.65
- The standard deviation 14.11
-
224.3 Summarizing a Data Set Boxplots
- Construction of a Skeletal Boxplot
- Draw a horizontal measurement scale
- Construct a rectangular box with a left (or
lower) edge at the lower quartile and a right (or
upper) edge at the upper quartile. The box width
is then equal to the iqr. - Draw a vertical line segment inside the box at
the location of the median. - Extend horizontal line segments, called whiskers,
from each end of the box to the smallest and
largest observations in the data set.
23 Boxplot for Hospital Cost-to-Charge Ratio
- Example Revisiting Hospital Cost-to-Charge
Ratios - Lower half 45 48 50 54 57 60 60 62 63 63 64 65
67 68 69 - Median 71
- Upper half 71 72 72 74 74 75 75 76 80 83 84 88
100 100 100 - Construct a modified boxplot of the data.
- smallest observation 45 largest observation
100 - lower quartile 62 upper quartile 76
- median 71
24Modified Boxplot
- An observation is an outlier if it is more than
1.5 iqr away from the nearest quartile (the
nearest end of the box). - An outlier is extreme if it is more than 3 iqr
from the nearest end of the box. - A modified boxplot
- Extend whiskers from each end of the box to the
most extreme observation that is not an outlier. - Draw a solid circle to mark the location of any
mild outlier. - Draw an open circle to mark the location of any
extreme outliers in the data set.
25Example Summarizing a Data Set
- The 51 observations for gross state product (in
billions of dollars) for the 50 states and the D.
C. are as follows - 16, 17, 18, 20, 21, 24, 30, 31, 32, 34, 40, 40,
41, 48, 52, - 54, 60, 62, 62, 63, 77, 82, 85, 100, 105,
- 107,
- 110, 129, 134, 142, 142, 158, 160, 161, 163,
165, 174, 193, 231, 236, 239, 254, 295, 319, 341,
364, 419, 426, 646, 707, 1119 - Construct a modified boxplot.
Lower quartile 41, upper quartile 231
iqr 231 41 190 1.5 iqr 285, 3 iqr
570
upper quartile 1.5 iqr 516, upper quartile
3 iqr 801
264.4 Interpreting Center and Variability
Chebyshevs Rule, the Empirical Rule, and z Scores
- Combine the mean and standard deviation to obtain
informative statements about - how the values in a data set are distributed
- the relative position of a particular value in a
data set. - It is useful to describe how far away a
particular observation is from the mean in terms
of the standard deviation. - A data set of scores on a standardized test with
a mean and standard deviation of 100 and 15,
respectively. - 70 85 100 115 130
- 2 sd 1 sd Mean 1 sd 2 sd
- below below above above
274.4 Interpreting Center and Variability
Chebyshevs Rule
- Consider any number k, where k gt 1. Then the
percentage of observations that are within k
standard deviations of the mean is at least
28Example Child Care Time for Preschool Kids
- An article examined various modes of care for
preschool children. For a sample of families with
one child, the mean and standard deviation of
child care time per week were approximately 36
hours and 12 hours, respectively. Use Chebyshevs
Rule to - display values that are 1, 2, and 3 standard
deviations from the mean. - determine what percentage of observations are
between 12 and 60 hours. - determine what percentage of observations exceed
72. - determine what percentage of observations are
less than 18.
29Answer to the Child Care Example
- 1 standard deviation from the mean is (24, 48).
(36 12) - 2 standard deviation from the mean is (12, 60).
(36 12x2) - 3 standard deviation from the mean is (0, 72).
(36 12x3) - The observations between 12 and 60 are within 2
standard deviations from the mean, and therefore,
k 2. By Chebyshevs Rule, at least - must be between 12 and 60 hours.
- 3. By Chebyshevs Rule, at least 89 of the
observations must be between 0 and 72, and
therefore, at most 11 are outside this interval.
Time cannot be negative, so at most 11 of the
observations exceed 72. - 4. The values 18 and 54 are 1.5 standard
deviations from the mean, so using k1.5 in
Chebyshevs Rule implies that at least 55.6 of
the observations must be between thse two values.
Thus at most 44.4 of the observations are less
than 18. If the distribution of values is
symmetric, then at most 22.2 are less than 18.
30Example IQ Score
- The following is a stem-and-leaf display of IQ
scores of 112 children. (Summary Quantities Mean
104.5, Standard deviation16.3). (All
observations within two standard deviations of
the mean are shown in blue.)
6 7 8 9 10 11 12 13 14 15
1 25679 0000124555668 0000112333446666778889 00011
22222333566677778899999 00001122333344444477899 01
111123445669 006 26 Stem Tens 2 Leaf
Ones
- Show how Chebyshevs Rule considerably
understates actual percentage.
Solution on next slide
31Solution to the Example IQ Score
- Chaebyshevs Rule considerably underestimate
actual percentages. The stem-and-leaf display of
IQ scores shows a symmetric shape, and in this
case Empirical Rule on next slide will have a
better estimate.
32The Empirical Rule
- If the histogram of values in a data set can be
reasonably well approximated by a normal curve.
Then - Approximately 68 of the observations are within
1 standard deviation of the mean. - Approximately 95 of the observations are within
2 standard deviation of the mean. - Approximately 99.7 of the observations are
within 3 standard deviation of the mean.
33Example Heights of Mothers
- A data set consists of 1052 measurements of
heights of mothers. The mean and standard
deviation were - Mean 62.484 in.
- Standard deviation 2.390 in.
- A normal curve did provide a good fit to the
data. - Use Chebyshevs Rule and the Empirical Rule to
summarize the distribution of mothers heights.
A summary table on next slide
34Answer to the Example Heights of Mothers
- The data distribution was approximately normal.
Therefore, the Empirical Rule is much more
successful and informative in this case than
Chebyshevs Rule.
35Measure of Relative Standing The z Score
- The z score corresponding to a particular value
is - The z score tells us how many standard deviations
the value is from the mean. It is positive or
negative according to whether the value lies
above or below the mean.
36Example Relatively Speaking, Which is the Better
Offer?
- Suppose that two graduating seniors, one a
marketing major and one an accounting major, are
comparing job offers. The accounting major has an
offer for 45,000 per year and the marketing
student has an offer for 42,500 per year. Which
is better offer relative to their own majors? - For accounting majors mean income salary
46,000 with standard deviation 1500 - For marketing majors mean income salary
43,000 with standard deviation 1000.
Answer on the next slide.
37Solution to the Example Relatively Speaking,
Which is the Better Offer?
The accounting majors salary (45,000) is 0.67
standard deviation below the mean salary of
accounting major, while the marketing majors
salary (42,500) is 0.5 standard deviation below
the mean. Relative to their own major, the
marketing offer is actually a little more
attractive.