DESCRIPTIVE STATISTICS - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

DESCRIPTIVE STATISTICS

Description:

An insurance company researcher conducted a survey on the number of car thefts ... back (mixture) stem and leaf plot and compare the distribution of the two groups. ... – PowerPoint PPT presentation

Number of Views:116
Avg rating:3.0/5.0
Slides: 58
Provided by: notesU
Category:

less

Transcript and Presenter's Notes

Title: DESCRIPTIVE STATISTICS


1
DESCRIPTIVE STATISTICS
  • BCT2053
  • CHAPTER TWO

2
CONTENT
  • 2.1 Statistical Tables and Charts
  • 2.2 Summary Statistics (Data Description)
  • Measures of Central Tendency
  • Measures of Variation
  • Measures of Position
  • 2.3 Exploratory Data Analysis

3
OBJECTIVES
At the end of this chapter, you should be able to
  • Organize data using frequency distributions.
  • Represent data in frequency distributions
    graphically using histograms, frequency polygons,
    and ogives.
  • Represent data using Pareto charts, time series
    graphs, and pie graphs.
  • Draw and interpret a stem and leaf plot.
  • Summarize data using measures of central
    tendency, such as the mean, median, mode, and
    midrange.
  • Describe data using measures of variation, such
    as the range, variance, and standard deviation.
  • Identify the position of a data value in a data
    set, using various measures of position, such as
    percentiles, deciles, and quartiles.
  • Use the techniques of exploratory data analysis,
    including boxplots and five-number summaries, to
    discover various aspects of data.

4
2.1 Statistical Tables and Chart
  • A. The raw data
  • A fresh data that have been collected from any
    resource
  • B. The data array
  • An arrangement of data items in either as
    ascending (lowest-highest) or descending
    (highest-lowest) order.
  • Frequency Tables (Frequency Distribution)
  • Groups data items into classes and then records
    the number of items that appear in each class.
  • Chart and Graph
  • Represent data in frequency distributions
    graphically using histograms, frequency polygons,
    and ogives.
  • Represent data using Pareto charts, time series
    graphs, and pie graphs

5
Example of Frequency distribution (general)
lower class limit
upper class limit
6
  • The lower class limit represents the smallest
    data value that can be included in the class.
  • The upper class limit represents the largest
    value that can be included in the class.
  • The class boundaries are used to separate the
    classes so that there are no gaps in the
    frequency distribution.
  • Rule of Thumb Class limits should have the same
    decimal place value as the data, but the class
    boundaries have one additional place value and
    end in a 5.
  • The class width for a class in a frequency
    distribution is found by subtracting the lower
    (or upper) class limit of one class from the
    lower (or upper) class limit of the next class.
  • The class midpoint is found by adding the lower
    and upper boundaries (or limits) and dividing by
    2.

7
  • Frequency distribution (frequency table) How to
    do (in general)?
  • Determine the number of classes that will be used
    to group the data.
  • a. Minimum 5, maximum 20
  • b. The actual number depends on such factor
  • i. The number of observations being group
  • ii. The purpose of the distribution
  • iii. The arbitrary preferences of the analyst
  • c. Use classes that can give you a good view of
    the data pattern and enable you
  • to gain insights into the information that is
    there
  • d. All data items from the smallest to the
    largest must be included
  • e. Each items must be assign to one and only one
    class
  • 2. Determine the width (class interval) of
    these classes
  • a. The width should be equal
  • b. Width range / number of classes
  • c. Whenever possible an open-ended class interval
    (one with an unspecified upper or lower

8
Types of Frequency Distribution
  • Ungrouped Frequency Distribution
  • Used for numerical data
  • The range of data is small
  • Categorical Frequency Distribution
  • Used for data that can be placed in specific
    categories such as nominal or ordinal level data
  • Grouped Frequency Distribution
  • Used for numerical data too
  • The range of the data is large

9
Twenty-five army inductees were given a blood
test to determine their blood type. The data set
is A B B AB O O O B AB B B B O A O
A O O O AB AB A O B AConstruct a
frequency distribution for the data.
Example Categorical Frequency Distribution
10
Constructing an ungrouped Grouped Frequency
Distribution STEP 1 Determine the classes. -
Find the highest and lowest value. - Find the
range. - Select the number of classes
desired. - Find the width by dividing the
range by the number of
classes and rounding up. - Select a starting
point (usually the lowest value or any
convenient number less than the lowest
value) add the width to get
the lower limits. - Find the upper class
limits. - Find the boundaries. STEP 2 Tally
the data. STEP 3 Find the numerical frequencies
from the tallies. STEP 4 Find the cumulative
frequencies.
11
Example Ungrouped Frequency Distribution
The data shown here represent the number of miles
per gallon that 30 selected four-wheel-drive
sports utility vehicles obtained in city driving.
Construct a frequency distribution. 12 17
12 14 16 18 16 18 12 16 17
15 15 16 12 15 16 16 12 14 15
12 15 15 19 13 16 18 16 14
12
Example Grouped Frequency Distribution
  • These data represent the record high
    temperatures for each of the 50 states.
  • 112 100 127 120 134 118 105 110 109
    112
  • 110 118 117 116 118 122 114 114 105
    109
  • 107 112 114 115 118 117 118 122 106
    110
  • 116 108 110 121 113 120 119 111 104
    111
  • 120 113 120 117 105 110 118 112 114
    114
  • Construct a grouped frequency distribution for
    the data using 7 classes.
  • Organize data into a frequency distribution table
    with 5 classes . Use 100 to lt107 for the first
    class.
  • Construct a frequency distribution for these
    data.

13
Why Construct Frequency Distributions?
To organize the data in a meaningful,
intelligible way.
To enable the reader to make comparisons among
different data sets.
To facilitate computational procedures for
measures of average and spread.
To enable the reader to determine the nature or
shape of the distribution.
To enable the researcher to draw charts and
graphs for the presentation of data.
14
Types of Graph Chart
  • The purpose of graphs in statistics is to convey
    the data to the viewer in pictorial form.
  • Graphs are useful in getting the audiences
    attention in a publication or a presentation.

15
(No Transcript)
16
A. Histogram, Frequency Polygon, Ogive
  • Histogram
  • A graph that displays the data by using vertical
    bars of various heights to represent the
    frequencies

17
  • Frequency Polygon
  • A graph that displays the data by using lines
    that connect points plotted for the frequencies
    at the midpoints of the classes. The frequencies
    represent the heights of the midpoints.

18
  • Ogive (Cumulative Frequency Graph)
  • A graph that represents the cumulative
    frequencies for the classes in a frequency
    distribution

19
Procedure to construct Histogram, Frequency
Polygon Ogive
  • STEP 1 Draw and label the x and y axes.
  • STEP 2 Choose a suitable scale for the
    frequencies or cumulative frequencies, and label
    it on the y axis.
  • STEP 3 Represent the class boundaries for the
    histogram or ogive, or the midpoint for the
    frequency polygon, on the x axis.
  • STEP 4 Plot the points and then draw the bars or
    lines.

20
Example
These data represent the record high temperatures
for each of the 50 states. Construct a grouped
frequency distribution for the data using 7
classes. Then, construct a histogram, frequency
polygon and ogive for these data. 112 100
127 120 134 118 105 110 109 112 110 118
117 116 118 122 114 114 105 109 107 112
114 115 118 117 118 122 106 110 116 108
110 121 113 120 119 111 104 111 120 113
120 117 105 110 118 112 114 114
21
Distribution Shapes
  • Bell Shaped
  • Has a single peak tapers off at either end
  • Approximately symmetry
  • It is roughly the same on the both sides of a
    line running through the center
  • J-Shaped
  • Has a few data values on the left side increase
    as one move to the right
  • Uniform
  • Basically flat/rectangular
  • Reverse J-Shaped
  • Opposite J-Shaped
  • Has a few data values on the right side
    increase as one move to the left

22
Distribution Shapes
  • Right Skewed
  • The peak is to the left
  • The data value taper off to the right
  • Bimodal
  • Have 2 peak at the same height
  • Right Skewed
  • The peak is to the right
  • The data value taper off to the left
  • U-Shaped
  • The shape is U

23
B. Pareto Chart
  • Used to represent a frequency distribution for a
    categorical variable and the frequency are
    displayed by the heights of vertical bars.

24
Example
Twenty-five army inductees were given a blood
test to determine their blood type. The data set
is A B B AB O O O B AB B B B O A O A
O O O AB AB A O B AConstruct a pareto
chart for the data.
25
C. Time Series Graph
  • Represents data that occur over a specified
    period of time
  • STEP 1 Draw and label the x and y axes.
  • STEP 2 Label the x axis for years and the y
    axis for the number of theaters.
  • STEP 3 Plot each point according to the
    table.
  • STEP 4 Draw line segments connecting
    adjacent points. Do not try to fit a
    smooth curve through the data points.
  • We look for a trend or pattern that occurs over
    the time period (ascending, descending) the
    slope or steepness of the line (increase,
    decrease)

Two time series graph for comparisons (compound
time series graph)
26
Example
In 1958, there were more than 4000 outdoor
drive-in theaters. The number of these theaters
has changed over the years. Draw a time series
graph for the data and summarize the
findings. Year Number 1988
1497 1990 910 1992 870 1994
859 1996 826 1998 750 2000 637
27
D. Pie Chart
A pie graph is a circle that is divided into
sections or wedges according to the percentage of
frequencies in each category of the
distribution. The purpose of the pie graph is to
show the relationship of the parts to the whole
by visually comparing the sizes of the sectors.
Percentages or proportions can be used. The
variable is nominal or categorical.
28
Example
Twenty-five army inductees were given a blood
test to determine their blood type. The data set
is A B B AB O O O B AB B B B O A O A
O O O AB AB A O B AConstruct a pie
chart for the data.
29
Stem-and-Leaf Plots
Stem leaf
  • A stem-and-leaf plot is a data plot that uses
    part of a data value as the stem (the leading
    digit) and part of the data value as the leaf
    (the trailing digit) to form groups or classes.
  • It has the advantage over grouped frequency
    distribution of retaining the actual data while
    showing them in graphic form.
  • Sometime we can construct a mixture model.

Leaf Stem Leaf
30
Example
An insurance company researcher conducted a
survey on the number of car thefts in a large
city for a period of 30 days last summer. The raw
data are shown below. Construct a stem and leaf
plot. 52 62 51 50 69 58 77 66
53 57 75 56 55 67 73 79 59 68
65 72 57 51 63 69 75 65 53 78
66 55
31
Example
  • The data shown represents the percentage of
    unemployed males and females in 1995 for a sample
    of countries of the world. Using the whole
    numbers as stems and the decimals as leaves,
    construct a back-to-back (mixture) stem and leaf
    plot and compare the distribution of the two
    groups.
  • Females Males
  • 8.0 3.7 8.6 5.0 8.8 1.9 5.6 4.6
  • 7.0 3.3 8.6 3.2 1.5 6.6 5.6 0.3
  • 8.8 6.8 9.2 5.9 2.2 5.6 3.1 5.9
  • 7.2 4.6 5.6 5.3 9.8 8.7 6.0 5.2
  • 7.7 8.0 8.7 0.5 4.4 9.6 6.6 6.0
  • 6.5 3.4 3.0 9.4 4.6 3.1 4.1 7.7

32
Conclusions (2.1)
  • Data can be organized in some meaningful way
    using frequency distributions. Once the frequency
    distribution is constructed, the representation
    of the data by graphs is a simple task.

33
2.2 Summary Statistics (Data Description)
  • Statistical methods can be used to summarize
    data.
  • Measures of average are also called measures of
    central tendency and include the mean, median,
    mode, and midrange.
  • Measures that determine the spread of data values
    are called measures of variation or measures of
    dispersion and include the range, variance, and
    standard deviation.
  • Measures of position tell where a specific data
    value falls within the data set or its relative
    position in comparison with other data values.
  • The most common measures of position are
    percentiles, deciles, and quartiles.
  • The measures of central tendency, variation, and
    position are part of what is called traditional
    statistics. This type of data is typically used
    to confirm conjectures about the data

34
  • Measures of Central Tendency

Mean the sum of the values divided by the total
number of values.
Population Mean
Sample Mean
Example 9 2 1 4 3 3 7 5 8
6
35
Mean
  • One computes the mean by using all the values of
    the data.
  • The mean varies less than the median or mode when
    samples are taken from the same population and
    all three measures are computed for these
    samples.
  • The mean is used in computing other statistics,
    such as variance.
  • The mean for the data set is unique, and not
    necessarily one of the data values.
  • The mean cannot be computed for an open-ended
    frequency distribution.
  • The mean is affected by extremely high or low
    values and may not be the appropriate average to
    use in these situations

36
  • Measures of Central Tendency

Median the middle number of n ordered data
(smallest to largest)
If n is odd
If n is even
Example 9 2 1 4 3 3
7 5 8 6
Example 9 2 1 3 3 7
5 8 6
37
Median
  • The median is used when one must find the center
    or middle value of a data set.
  • The median is used when one must determine
    whether the data values fall into the upper half
    or lower half of the distribution.
  • The median is used to find the average of an
    open-ended distribution.
  • The median is affected less than the mean by
    extremely high or extremely low values.

38
  • Measures of Central Tendency

Mode the most commonly occurring value in a data
series
  • The mode is used when the most typical case is
    desired.
  • The mode is the easiest average to compute.
  • The mode can be used when the data are nominal,
    such as religious preference, gender, or
    political affiliation.
  • The mode is not always unique. A data set can
    have more than one mode, or the mode may not
    exist for a data set.

Example 9 2 1 4 3 3 7 5 8 6
39
  • Measures of Central Tendency

Midrange is a rough estimate of the middle
also a very rough estimate of the average and can
be affected by one extremely high or low value.
Example 9 2 1 4 3 3 7 5 8 6
40
Types of Distribution
Symmetric
Positively skewed or right-skewed
Negatively skewed or left-skewed
41
  • Measures of Variation / Dispersion
  • Used when the central of tendency doesn't mean
    anything or not needed (eg mean are same for two
    types of data)
  • One that gauges the variability that exists in a
    data set
  • To form a judgment about how well the average
    value illustrate/ depict the data
  • To learn the extent of the scatter so that steps
    may be taken to control the existing variation

42
  • Measures of Variation / Dispersion

Range is the different between the highest
value and the lowest value in a data set. The
symbol R is used for the range.
R highest value - lowest value
Example 9 2 1 4 3 3 7 5 8 6
43
  • Measures of Variation / Dispersion

Variance is the average of the squares of the
distance each value is from the mean.
Population Variance
Sample Variance
Population standard deviation , ?
Sample standard deviation, s
Example 9 2 1 4 3 3
7 5 8 6
Standard Deviation is the square root of the
variance
44
Variance Standard deviation
  • Variances and standard deviations can be used to
    determine the spread of the data. If the variance
    or standard deviation is large, the data are more
    dispersed. The information is useful in comparing
    two or more data sets to determine which is more
    variable.
  • The measures of variance and standard deviation
    are used to determine the consistency of a
    variable.
  • The variance and standard deviation are used to
    determine the number of data values that fall
    within a specified interval in a distribution.
  • The variance and standard deviation are used
    quite often in inferential statistics.

45
Chebychev theorem
46
Describing the position of the data value
  • Measures of Position

Percentile
Quartile
Deciles
Example1 9 2 1 4 3 3 7 5 8
6 Example 2 19 2 1 4 3 3 7 5
8 6
47
Example
  • Given
  • 9 2 1 4 3 7 5 4 6
  • What percentile is the value of 8
  • Given
  • 9 22 11 14 13 3 7 15 18 16
  • What percentile is the value of 20

48
Outliers
  • An outlier is an extremely high or an extremely
    low data value when compared with the rest of the
    data values.
  • Outliers can be the result of measurement or
    observational error.
  • When a distribution is normal or bell-shaped,
    data values that are beyond three standard
    deviations of the mean can be considered
    suspected outliers.

Example 9 22 11 14 13 3
7 15 18 16
49
The measures of central tendency, variation, and
position for Grouped data
measures of central tendency
Mean Class
Median class
Mode class
50
measures of Variation
Population variance
Sample variance
51
measures of position
Quartile
Decile
Percentile
52
Example
Find mean, median class, mode class, population
sample variance, quartile, decile percentile.
53
2.3 Exploratory Data Analysis
  • The purpose of exploratory data analysis is to
    examine data in order to find out what
    information can be discovered.
    For example
  • Are there any gaps in the data?
  • Can any patterns be discerned?

54
Boxplots
  • Boxplots are graphical representations of a
    five-number summary of a data set and outliers.
    The five specific values that make up a
    five-number summary are
  • The lowest value of data set (minimum)
  • Q1 (or 25th percentile)
  • The median (or 50th percentile)
  • Q3 (or 75th percentile)
  • The highest value of data set (maximum)

55
STEP to construct a Boxplot
  • STEP1 Arrange the data
  • STEP2 Find the Median
  • STEP3 Find Q1 and Q3
  • STEP4 Find Outliers
  • Points that lying more than 1.5 times the
    interquartile range above Q3 or below Q1
  • STEP5 Draw a scale for the data on the x axis.
  • STEP6 Locate the lowest value, Q1, the median,
    Q3, the highest value and outliers on the scale.
  • STEP7 Draw a box around Q1 and Q3, draw a
    vertical line through the median, and connect the
    upper and lower values

Example1 9 22 11 14 13 3
7 15 18 16 Example 2 19 2
1 7 5 8 6
56
(No Transcript)
57
CONCLUSIONS
  • By combining all of these techniques discussed in
    this chapter together, the student is now able
    to collect, organize, summarize and present data.
Write a Comment
User Comments (0)
About PowerShow.com