Numerical Methods for Describing Data - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Numerical Methods for Describing Data

Description:

Forty students were enrolled in a section of statistics class. ... FDA scientists analyzed McDonald's French fries purchased at seven different ... – PowerPoint PPT presentation

Number of Views:228
Avg rating:3.0/5.0
Slides: 38
Provided by: shish
Category:

less

Transcript and Presenter's Notes

Title: Numerical Methods for Describing Data


1
Chapter 4
  • Numerical Methods for Describing Data

2
4.1 Describing the Center of a Data Set
  • Sample Mean
  • x the variable for which we have sample data
  • n the number of observations in the sample
    (sample size)
  • x1, x2, . . ., xn the observations in the
    sample
  • The sample mean
  • Populations Mean
  • The population mean, denoted by µ, is the
    average of all x values in the entire population

3
4.1 Describing the Center of a Data Set
  • Example Traumatic knee dislocation often
    requires surgery to repair ruptured ligaments.
    One measure of recovery is range of motion
    (measured as the angle formed when, starting with
    the leg straight, the knee is bent as far as
    possible). The following are postsurgical range
    of motion (in degrees) for a sample of 13
    patients
  • 154, 142, 137, 133, 122, 126, 135, 135, 108,
    120, 127, 134, 122
  • Find the sample mean range of motion.

4
Example Three Samples from the Population of All
US Counties (x number of residents)
  • Mean 41,639.2
  • Mean 202,769.0

5
Example Three Samples from the Population of All
US Counties (x number of residents)
  • Population
  • The population in US is 293,655,404 (2004
    census), and there are 3137 counties in US.
  • Population Mean
  • µ ?

Mean ?
The mean of Sample 3 is 23,643.6. The population
mean is 93,610.27
6
Example Number of Visits to a Class Web Site
The Mean
  • Forty students were enrolled in a section of
    statistics class. One month after the course
    began, the instructor requested a report that
    indicated how many times each student had
    accessed the class web site. The 40 observations
    were listed on the right.
  • Find the mean.
  • Does the mean well represent the center of the
    data set?


Answer 2. 23.10 is not a very representative
value for the center of this sample, because it
is larger than most of the observations in the
data. The outlying values 331, and, to a lesser
extent, 84 have a substantial impact on the mean.
7
4.1 Describing the Center of a Data Set
  • The sample median is obtained by
  • first ordering the n observations from smallest
    to largest (with any repeated values included, so
    that every sample observation appear in the
    ordered list).
  • Then
  • sample median the single middle value if n is
    odd, OR
  • the average of the middle two values if n is
    even.
  • Examples (a) The median of the data set 5, 3,
    7, 9, 11, 4, 16 is 7, which is the single middle
    value of the ordered observation 3, 4, 5, 7, 9,
    11, 16 (In this sample, n 7 )
  • (b) The median of the data set 7, 3, 9, 5, 11,
    4, 16, 10 is 8, which is the average of the two
    middle values of the ordered observation 3, 4,
    5, 7, 9, 10, 11, 16. ( ½ (7 9 ) 8 )

8
Example Number of Visits to a Class Web Site
The Median
  • Forty students were enrolled in a section of
    statistics class. One month after the course
    began, the instructor requested a report that
    indicated how many times each student had
    accessed the class web site. The 40 observations
    were listed on the right.
  • Find the median.
  • Does the median better represent the center of
    the data set than the mean?

Answer on the next slide
9
Solution to the Example Number of Visits to a
Class Web Site The Median
  • First arrange the data from the smallest to the
    largest

The median is the average of the middle two
values (13 13)/2 13. The median better
represents the center of this data set than the
mean (23.10).
10
Comparing the Mean and the Median
  • The mean can be sensitive to even a single value
    that lies far above or below the rest of the data
    (outliers). Therefore, the mean for the number of
    website visit is not a very representative value
    of that sample because of the outlying value 331.
  • The median is quite insensitive to outliers.
  • If the histogram of the data set is symmetric,
    mean median.

11
4.2 Describing Variability in a Data Set
  • The range of a data set is defined as
  • range largest observation smallest
    observations
  • The n deviations from the sample mean are the
    differences
  • Except for the effects of rounding in computing
    in deviations, it is always true that

12
4.2 Describing Variability in a Data Set
  • The sample variance, denoted by s2, is defined by
  • The sample standard deviation is the positive
    square root of the sample variance and is denoted
    by s.

The mean, median, sample variance, and standard
deviation are all available in Excel by
clicking Data ? Data Analysis ? Descriptive
Statistics In the dialog box, choose Summary
statistics.
13
4.2 Describing Variability in a Data Set
  • Example Research by FDA shows that acrylamide (a
    possible cancer-causing substance) forms in high
    carbohydrate foods cooked at high temperature and
    that acrylamide level can vary widely even within
    the same brand of food. FDA scientists analyzed
    McDonalds French fries purchased at seven
    different locations and found the acrylamide
    levels shown on the right
  • What is the range?
  • What are the deviations?
  • What is the sample variance?
  • What is the standard deviation?

Answers on the next slide.
14
Answers to the example of acrylamide data 1.
Range 497 155 342
(See the table on the right.)
15
Number of Visits to a Class Web Site
  • Forty students were enrolled in a section of
    statistics class. One month after the course
    began, the instructor requested a report that
    indicated how many times each student had
    accessed the class web site. The 40 observations
    were listed on the right.
  • Find the sample variance and standard
    deviation.
  • We are using Excel to solve the problem on next
    three slides.

16
  • Input data
  • Click Data
  • Click Data
  • Analysis
  • Choose
  • Descriptive
  • Statistics
  • Click OK

17
  • A dialog
  • box will
  • open.
  • Enter
  • A1A40
  • in the
  • Input
  • Range.
  • Check
  • Summary
  • statistics.
  • Click OK.

18
  • Excel gives the results for Mean, Median, Sample
    Variance, Standard Deviation and
  • many others. (you may need to adjust the width of
    columns.)

19
Describing Variability in a Data Set
Interquartile Range
  • Lower quartile median of the lower half of the
    sample
  • Upper quartile median of the upper half of the
    sample
  • (If n is odd, the median of the entire sample is
    excluded from both halves.)
  • The interquartile range (iqr), a measure of
    variability that is not as sensitive to the
    presence of outliers as the standard deviation,
    is given by
  • iqr upper quartile lower quartile

20
Example Hospital Cost-to-Charge Ratios
  • The cost-to-charge ratio is computed as the ratio
    of the actual cost of care to what the hospital
    actually bills, and the ratio is usually
    expressed as a percentage. A cost-to-charge ratio
    of 60 means that the actual cost is 60 of what
    was billed. The ratios for 31 hospitals in Oregon
    for inpatient services published in the Oregon
    Department of Health in 2002 were
  • 68, 76, 60, 88, 69, 80, 75, 67, 71, 100, 63,
    62
  • 71, 74, 64, 48, 100, 72, 65, 50, 72, 100, 63,
    45
  • 54, 60, 75, 57, 74, 84, 83
  • Find the lower quartile, upper quartile and
    interquartile range (iqr).

21
Solution to the Example Hospital Cost-to-Charge
Ratios
  • Ordered data
  • Lower half 45 48 50 54 57 60 60 62 63 63 64 65
    67 68 69
  • Median 71
  • Upper half 71 72 72 74 74 75 75 76 80 83 84 88
    100 100 100
  • Lower quartile 62
  • Upper quartile 76
  • iqr 76 62 14
  • The mean 70.65
  • The standard deviation 14.11

22
4.3 Summarizing a Data Set Boxplots
  • Construction of a Skeletal Boxplot
  • Draw a horizontal measurement scale
  • Construct a rectangular box with a left (or
    lower) edge at the lower quartile and a right (or
    upper) edge at the upper quartile. The box width
    is then equal to the iqr.
  • Draw a vertical line segment inside the box at
    the location of the median.
  • Extend horizontal line segments, called whiskers,
    from each end of the box to the smallest and
    largest observations in the data set.

23
Boxplot for Hospital Cost-to-Charge Ratio
  • Example Revisiting Hospital Cost-to-Charge
    Ratios
  • Lower half 45 48 50 54 57 60 60 62 63 63 64 65
    67 68 69
  • Median 71
  • Upper half 71 72 72 74 74 75 75 76 80 83 84 88
    100 100 100
  • Construct a modified boxplot of the data.
  • smallest observation 45 largest observation
    100
  • lower quartile 62 upper quartile 76
  • median 71

24
Modified Boxplot
  • An observation is an outlier if it is more than
    1.5 iqr away from the nearest quartile (the
    nearest end of the box).
  • An outlier is extreme if it is more than 3 iqr
    from the nearest end of the box.
  • A modified boxplot
  • Extend whiskers from each end of the box to the
    most extreme observation that is not an outlier.
  • Draw a solid circle to mark the location of any
    mild outlier.
  • Draw an open circle to mark the location of any
    extreme outliers in the data set.

25
Example Summarizing a Data Set
  • The 51 observations for gross state product (in
    billions of dollars) for the 50 states and the D.
    C. are as follows
  • 16, 17, 18, 20, 21, 24, 30, 31, 32, 34, 40, 40,
    41, 48, 52,
  • 54, 60, 62, 62, 63, 77, 82, 85, 100, 105,
  • 107,
  • 110, 129, 134, 142, 142, 158, 160, 161, 163,
    165, 174, 193, 231, 236, 239, 254, 295, 319, 341,
    364, 419, 426, 646, 707, 1119
  • Construct a modified boxplot.

Lower quartile 41, upper quartile 231
iqr 231 41 190 1.5 iqr 285, 3 iqr
570
upper quartile 1.5 iqr 516, upper quartile
3 iqr 801
26
4.4 Interpreting Center and Variability
Chebyshevs Rule, the Empirical Rule, and z Scores
  • Combine the mean and standard deviation to obtain
    informative statements about
  • how the values in a data set are distributed
  • the relative position of a particular value in a
    data set.
  • It is useful to describe how far away a
    particular observation is from the mean in terms
    of the standard deviation.
  • A data set of scores on a standardized test with
    a mean and standard deviation of 100 and 15,
    respectively.
  • 70 85 100 115 130
  • 2 sd 1 sd Mean 1 sd 2 sd
  • below below above above

27
4.4 Interpreting Center and Variability
Chebyshevs Rule
  • Consider any number k, where k gt 1. Then the
    percentage of observations that are within k
    standard deviations of the mean is at least

28
Example Child Care Time for Preschool Kids
  • An article examined various modes of care for
    preschool children. For a sample of families with
    one child, the mean and standard deviation of
    child care time per week were approximately 36
    hours and 12 hours, respectively. Use Chebyshevs
    Rule to
  • display values that are 1, 2, and 3 standard
    deviations from the mean.
  • determine what percentage of observations are
    between 12 and 60 hours.
  • determine what percentage of observations exceed
    72.
  • determine what percentage of observations are
    less than 18.

29
Answer to the Child Care Example
  • 1 standard deviation from the mean is (24, 48).
    (36 12)
  • 2 standard deviation from the mean is (12, 60).
    (36 12x2)
  • 3 standard deviation from the mean is (0, 72).
    (36 12x3)
  • The observations between 12 and 60 are within 2
    standard deviations from the mean, and therefore,
    k 2. By Chebyshevs Rule, at least
  • must be between 12 and 60 hours.
  • 3. By Chebyshevs Rule, at least 89 of the
    observations must be between 0 and 72, and
    therefore, at most 11 are outside this interval.
    Time cannot be negative, so at most 11 of the
    observations exceed 72.
  • 4. The values 18 and 54 are 1.5 standard
    deviations from the mean, so using k1.5 in
    Chebyshevs Rule implies that at least 55.6 of
    the observations must be between thse two values.
    Thus at most 44.4 of the observations are less
    than 18. If the distribution of values is
    symmetric, then at most 22.2 are less than 18.

30
Example IQ Score
  • The following is a stem-and-leaf display of IQ
    scores of 112 children. (Summary Quantities Mean
    104.5, Standard deviation16.3). (All
    observations within two standard deviations of
    the mean are shown in blue.)

6 7 8 9 10 11 12 13 14 15
1 25679 0000124555668 0000112333446666778889 00011
22222333566677778899999 00001122333344444477899 01
111123445669 006 26 Stem Tens 2 Leaf
Ones
  • Show how Chebyshevs Rule considerably
    understates actual percentage.

Solution on next slide
31
Solution to the Example IQ Score
  • Chaebyshevs Rule considerably underestimate
    actual percentages. The stem-and-leaf display of
    IQ scores shows a symmetric shape, and in this
    case Empirical Rule on next slide will have a
    better estimate.

32
The Empirical Rule
  • If the histogram of values in a data set can be
    reasonably well approximated by a normal curve.
    Then
  • Approximately 68 of the observations are within
    1 standard deviation of the mean.
  • Approximately 95 of the observations are within
    2 standard deviation of the mean.
  • Approximately 99.7 of the observations are
    within 3 standard deviation of the mean.

33
Example Heights of Mothers
  • A data set consists of 1052 measurements of
    heights of mothers. The mean and standard
    deviation were
  • Mean 62.484 in.
  • Standard deviation 2.390 in.
  • A normal curve did provide a good fit to the
    data.
  • Use Chebyshevs Rule and the Empirical Rule to
    summarize the distribution of mothers heights.

A summary table on next slide
34
Answer to the Example Heights of Mothers
  • The data distribution was approximately normal.
    Therefore, the Empirical Rule is much more
    successful and informative in this case than
    Chebyshevs Rule.

35
Measure of Relative Standing The z Score
  • The z score corresponding to a particular value
    is
  • The z score tells us how many standard deviations
    the value is from the mean. It is positive or
    negative according to whether the value lies
    above or below the mean.

36
Example Relatively Speaking, Which is the Better
Offer?
  • Suppose that two graduating seniors, one a
    marketing major and one an accounting major, are
    comparing job offers. The accounting major has an
    offer for 45,000 per year and the marketing
    student has an offer for 42,500 per year. Which
    is better offer relative to their own majors?
  • For accounting majors mean income salary
    46,000 with standard deviation 1500
  • For marketing majors mean income salary
    43,000 with standard deviation 1000.

Answer on the next slide.
37
Solution to the Example Relatively Speaking,
Which is the Better Offer?
The accounting majors salary (45,000) is 0.67
standard deviation below the mean salary of
accounting major, while the marketing majors
salary (42,500) is 0.5 standard deviation below
the mean. Relative to their own major, the
marketing offer is actually a little more
attractive.
Write a Comment
User Comments (0)
About PowerShow.com