Describing, Exploring, and Comparing Data - PowerPoint PPT Presentation

1 / 81
About This Presentation
Title:

Describing, Exploring, and Comparing Data

Description:

The world would be a better place if we lost half of them -- starting with 8. ... A cumulative frequency polygon (or ogive) uses line segments connected to points ... – PowerPoint PPT presentation

Number of Views:281
Avg rating:3.0/5.0
Slides: 82
Provided by: wild6
Category:

less

Transcript and Presenter's Notes

Title: Describing, Exploring, and Comparing Data


1
Describing, Exploring, and Comparing Data
There are way too many numbers. The world would
be a better place if we lost half of them --
starting with 8. I've always hated 8. Homer J.
Simpson
2
Overview
3
Planning and Conducting a Study
  • Understand the Nature of the Problem
  • Decide What to Measure and How to Measure It
  • Data Collection
  • Data Summarization and Preliminary Analysis
  • Formal Data Analysis
  • Interpretation of Results

Statistics The Exploration and analysis of Data,
4th ed. Devore/Peck
4
Important Characteristics of Data
  • Center A representative or average value that
    indicates where the middle of the data set is
    located.
  • Variation A measure of the amount that the data
    values vary among themselves.
  • Distribution The nature or shape of the
    distribution of the data.
  • Outliers Sample values that lie very far away
    from the vast majority of the other sample
    values.
  • Time Changing characteristics of the data over
    time.

5
Two Branches of Statistics
  • Descriptive statistics methods used to
    summarize or describe the important
    characteristics of a set of data.
  • Inferential statistics methods used with sample
    data to make inferences (or generalizations)
    about a population.

6
Frequency Distributions
7
Definition
  • A frequency distribution (or frequency table)
    lists data values (either individually or by
    groups of intervals), along with their
    corresponding frequencies (or counts).

8
More Definitions
  • Lower class limits are the smallest numbers that
    can belong to the different classes.
  • Upper class limits are the largest numbers that
    can belong to the different classes.
  • Class boundaries are the numbers used to separate
    classes, but without the gaps created by class
    limits.
  • Class midpoints are the values in the middle of
    the classes. Each class midpoint can be found by
    adding the lower class limit to the upper class
    limit and dividing the sum by 2.
  • Class width is the difference between two
    consecutive lower class limits or two consecutive
    lower class boundaries.

9
Procedure for Constructing a Frequency
Distribution
  • Decide on the number of classes needed.
  • CalculateRound this result to get a convenient
    number.
  • Starting point Begin by choosing a number for
    the lower limit of the first class.
  • Using the lower limit of the first class and the
    class width, proceed to list the other lower
    class limits.
  • List the lower class limits in a vertical column
    and proceed to enter the upper class limits.
  • Go through the data set putting a tally in the
    appropriate class for each data value. Use the
    tally marks to find the total frequency for each
    class.

10
Example
  • Use Data Set 6 Bears, and construct a frequency
    distribution for the lengths of bears using 11
    classes.

11
Example (continued)
12
Example (continued)
13
Relative Frequency Distribution
  • A relative frequency distribution includes the
    same class limits as a frequency distribution,
    but relative frequencies (a relative frequency is
    found by dividing a class frequency by the total
    frequency) are used instead of actual frequencies.

14
Example
  • Use Data Set 6 Bears, and construct a relative
    frequency distribution for the lengths of bears
    using 11 classes.

15
Example (continued)
16
Cumulative Frequency Distribution
  • A cumulative frequency distribution includes the
    same class limits as a frequency distribution,
    but cumulative frequencies (a cumulative
    frequency for a class is the sum of the
    frequencies for that class and all previous
    classes) are used instead of actual frequencies.

17
Interpreting Frequency Distributions
  • Is the distribution normal?
  • Do the frequencies start low, then increase to
    some maximum frequency, then decrease to a low
    frequency?
  • Is the distribution approximately symmetric? That
    is, are the frequencies evenly distributed on
    both sides of the maximum frequency?

18
Visualizing Data
19
Histogram
  • A histogram is a bar graph in which the
    horizontal scale represents classes of data
    values and the vertical scale represents
    frequencies. The heights of the bars correspond
    to the frequency values, and the bars are drawn
    adjacent to each other (without gaps).

20
Example
  • Use Data Set 6 Bears, and construct a histogram
    for the lengths of bears using 11 classes.

21
Example (continued)
22
Interpreting Histograms
  • Is the distribution normal?
  • Do the frequencies start low, then increase to
    some maximum frequency, then decrease to a low
    frequency?
  • Is the distribution approximately symmetric? That
    is, are the frequencies evenly distributed on
    both sides of the maximum frequency?

23
Histogram
  • A relative frequency histogram has the same shape
    and horizontal scale as a histogram, but the
    vertical scale is marked with relative
    frequencies instead of actual frequencies.

24
Example
  • Use Data Set 6 Bears, and construct a relative
    frequency histogram for the lengths of bears
    using 11 classes.

25
Example (continued)
26
Frequency Polygons
  • A frequency polygon uses line segments connected
    to points located directly above class midpoint
    values.
  • A cumulative frequency polygon (or ogive) uses
    line segments connected to points located
    directly above class midpoint values.

27
Dotplot
  • A dotplot consists of a graph in which each data
    value is plotted as a point (or dot) along a
    scale of values. Dots representing equal values
    are stacked.

28
Stemplot
  • A stemplot (or stem-and-leaf-plot) represents
    data by separating each value into two parts
  • the stem (such as the leftmost digit), and
  • the leaf (such as the rightmost digit).

29
Pareto Charts
  • A Pareto chart is a bar graph for qualitative
    data, with the bars arranged in order according
    to frequency. Vertical scales in Pareto charts
    can represent frequencies or relative frequencies.

30
Pie Charts
  • A pie chart is a graph depicting qualitative data
    as slices of a pie.

31
Scatterplots
  • A scatterplot (or scatter diagram) is a plot of
    paired (x, y) data with a horizontal x-axis and a
    vertical y-axis. The data are paired in a way
    that matches each value from one data set with a
    corresponding value from a second data set.

32
Time-Series Graph
  • A time-series graph is graph of time-series data,
    which are data that have been collected at
    different points in time.

33
Presenting Data Graphically
  • Some important principles
  • For small data sets of 20 values or fewer, use a
    table instead of a graph.
  • A graph of data should make the viewer focus on
    the true nature of the data, not on other
    elements, such as eye-catching but distracting
    design features.
  • Do not distort the data construct a graph to
    reveal the true nature of the data.
  • Almost all of the ink in a graph should be used
    for the data, not for other design elements.

The Visual Display of Quantitative Information,
2nd ed. Tufte
34
Presenting Data Graphically
  • Some important principles
  • Dont use screening consisting of features such
    as slanted lines, dots, or cross-hatching,
    because they create the uncomfortable illusion of
    movement.
  • Dont use areas or volumes for data that are
    actually one-dimensional in nature.
  • Never publish pie charts, because they waste ink
    on non-data components, and they lack an
    appropriate scale.

The Visual Display of Quantitative Information,
2nd ed. Tufte
35
Measures of Center
36
Definition
  • A measure of center is a value at the center or
    middle of a data set.

37
Mean
  • The arithmetic mean of a set of values is the
    measure of center found by adding the values and
    dividing the total by the number of values. It is
    referred to simply as the mean.

38
Notation
  • denotes the sum of a set of values.
  • x is the variable usually used to
    represent the individual data
    values.
  • n represents the number of values in a
    sample.
  • N represents the number of values in a
    population.
  • is the mean of a set of sample
    values.
  • is the mean of all values in a
    population.

39
Median
  • The median of a data set is the measure of center
    that is the middle value when the original data
    values are arranged in order of increasing (or
    decreasing) magnitude. The median is often
    denoted by .

40
Mode
  • The mode of a data set is the value that occurs
    most frequently.
  • When two values occur with the same greatest
    frequency, each one is a mode and the data set is
    bimodal.
  • When more than two values occur with the same
    greatest frequency, each is a mode and the data
    set is said to be multimodal.
  • When no value is repeated, we say that there is
    no mode.

41
Midrange
  • The midrange is the measure of center that is the
    value midway between the maximum and minimum
    values in the original data set. It is found by
    adding the maximum data value to the minimum data
    value and then dividing the sum by 2, that is,

42
Round-Off Rule
  • A simple rule for rounding answers is this
  • Carry one more decimal place than is present in
    the original set of values.

43
Example
  • Use Data Set 6 Bears, find the four measures of
    center for the lengths of the bears in the sample.

44
The Best Measure of Center

45
Weighted Mean
  • A weighted mean of x values computed with the
    different values assigned different weights,
    denoted by w, as given in In particular, the
    mean of a frequency distribution can be found
    using

46
Skewness and Symmetry
  • A distribution of data is skewed if it is not
    symmetric and extends more to one side than the
    other. A distribution of data is symmetric if the
    left half of its histogram is roughly a mirror
    image of its right half.

47
Measures of Variation
48
Range
  • The range of a set of data is the difference
    between the maximum value and the minimum value.

49
Measuring Deviation
  • Find the mean for each the following two sets of
    data

50
Measuring Deviation
  • Calculate the total deviation for each of the two
    sets of data

51
Measuring Deviation
  • Calculate the total deviation for each of the two
    sets of data

52
Standard Deviation
  • The standard deviation of a set of sample values
    is a measure of variation of value about the
    mean. It is a type of average deviation of values
    from the mean that is calculated by

53
Standard Deviation of a Population
  • The standard deviation of a population is
    calculated by

54
Variance of a Sample and Population
  • The variance of a set of values is a measure of
    variation equal to the square of the standard
    deviation.
  • Sample variance
  • Population variance

55
Notation
  • Sample standard deviation s
  • Sample variance
  • Population standard deviation
  • Population variance

Note Articles in professional journals and
reports often use SD for standard deviation and
VAR for variance.
56
Round-Off Rule
  • We use the same round-off rule given in the
    previous section
  • Carry one more decimal place than is present in
    the original set of values.
  • Round only the final answer, not in the middle of
    a calculation. (If it becomes absolutely
    necessary to round in the middle, carry at least
    twice as many decimal places as will be used in
    the final answer.

57
Example
  • Use Data Set 6 Bears, find the three measures of
    variation for the lengths of the bears in the
    sample.

58
Coefficient of Variation
  • The coefficient of variation (or CV) for a set of
    nonnegative sample or population data, expressed
    as a percent, describes the standard deviation
    relative to the mean, and is given by the
    following Sample
    Population

59
Range Rule of Thumb
  • For Estimating a Value of the Standard Deviation
    s To roughly estimate the standard deviation
    from a collection of know sample data,
    usewhere

60
Range Rule of Thumb
  • For Interpreting a Known Value of the Standard
    Deviation If the standard deviation is known,
    use it to find rough estimates of the minimum and
    maximum usual sample values by using the
    following

61
Example
  • Use the Range Rule of Thumb to find the maximum
    and minimum usual values for the lengths our
    bears.

62
Empirical (or 68-95-99.7) Rule for Data with a
Bell-Shaped Distribution
  • Another rule that is helpful in interpreting
    values for a standard deviation is the empirical
    rule. This rule states that for data sets having
    a distribution that is approximately bell-shaped,
    the following properties apply.
  • About 68 of all values fall within 1 standard
    deviation of the mean.
  • About 95 of all values fall within 2 standard
    deviations of the mean.
  • About 99.7 of all values fall within 3 standard
    deviations of the mean.

63
Empirical (or 68-95-99.7) Rule for Data with a
Bell-Shaped Distribution

64
Chebyshevs Theorem
  • The proportion (or fraction) of any set of data
    lying within K standard deviations of the mean
    is always at least
    , where K is any
    positive number greater than 1.

65
Measures of Relative Standing
66
z Scores
  • A z score (or standardized value), is the number
    of standard deviations that a given value x is
    above or below the mean. It is found by using the
    following expressions Sample
    Population(Round z to two decimal
    places.)

67
Interpreting z Scores
  • Ordinary Values
  • Unusual values

68
Percentiles
  • The percentile that corresponds to a particular
    value x is given by

69
Percentiles
  • Notation
  • n total number of values in the data set
  • k percentile being used
  • L locator that gives the position of a value
  • Pk kth percentile

70
Percentiles

71
Quartiles
  • Q1 (First Quartile) Separates the bottom 25
    from the top 75.
  • Q2 (Second Quartile) Same as the median
    separates the bottom 50 from the top 50.
  • Q3 (Third Quartile) Separates the bottom 75
    from the top 25.

72
Example
  • Use the bear data to find
  • the percentile corresponding to a length of 57.0
    in,
  • the length corresponding to the 25th percentile,
  • the length corresponding to the first quartile.

73
Interquartile Range (IQR)
  • The interquartile range (IQR)is given by

74
Exploratory Data Analysis (EDA)
75
Exploratory Data Analysis (EDA)
  • Exploratory data analysis is the process of using
    statistical tools (such as graphs, measures of
    center, measures of variation) to investigate
    data sets in order to understand their important
    characteristics.

76
Outliers
  • Informally, an outlier is a value that is located
    very far away from almost all other values.
  • An outlier can have a dramatic effect on the
    mean.
  • An outlier can have a dramatic effect on the
    standard deviation.
  • An outlier can have a dramatic effect on the
    scale of the histogram so the true nature of the
    distribution is totally obscured.

77
Boxplots
  • For a set of data, the 5-number summary consists
    of the minimum value, the first quartile Q1, the
    median (or second quartile Q2), the third
    quartile Q3, and the maximum value.
  • A boxplot (or box-and-whisker diagram) is a graph
    of a data set that consists of a line extending
    from the minimum value to the maximum value, and
    a box with lines drawn at the first quartile Q1,
    the median, and the third quartile Q3.

78
Example
  • Find a five number summary for the bear data, and
    use this information to draw a boxplot of the
    lengths of the bears.

79
Outliers
  • More formally, a data value is an outlier if it
    is
  • above Q3 by an amount greater than 1.5 x IQR, or
  • below Q1 by an amount greater than 1.5 x IQR

80
Modified Boxplot
  • A modified boxplot is a boxplot constructed with
    these modifications
  • A special symbol (such as an asterick) is used to
    identify outliers as defined here, and
  • the solid horizontal line extends only as far as
    the minimum data value that is no an outlier and
    the maximum data value that is not an outlier.

81
Example
  • Determine if the bear data, contains any
    outliers, and if necessary, draw a modified
    boxplot of the lengths of the bears.
Write a Comment
User Comments (0)
About PowerShow.com