DESCRIPTIVE STATISTICS presentation

About This Presentation

Transcript and Presenter's Notes

Title: DESCRIPTIVE STATISTICS

1
DESCRIPTIVE STATISTICS

BCT2053
CHAPTER TWO

2
CONTENT

2.1 Statistical Tables and Charts
2.2 Summary Statistics (Data Description)
Measures of Central Tendency
Measures of Variation
Measures of Position
2.3 Exploratory Data Analysis

3
OBJECTIVES
At the end of this chapter, you should be able to

Organize data using frequency distributions.
Represent data in frequency distributions
graphically using histograms, frequency polygons,
and ogives.
Represent data using Pareto charts, time series
graphs, and pie graphs.
Draw and interpret a stem and leaf plot.
Summarize data using measures of central
tendency, such as the mean, median, mode, and
midrange.
Describe data using measures of variation, such
as the range, variance, and standard deviation.
Identify the position of a data value in a data
set, using various measures of position, such as
percentiles, deciles, and quartiles.
Use the techniques of exploratory data analysis,
including boxplots and five-number summaries, to
discover various aspects of data.

4
2.1 Statistical Tables and Chart

A. The raw data
A fresh data that have been collected from any
resource
B. The data array
An arrangement of data items in either as
ascending (lowest-highest) or descending
(highest-lowest) order.
Frequency Tables (Frequency Distribution)
Groups data items into classes and then records
the number of items that appear in each class.
Chart and Graph
Represent data in frequency distributions
graphically using histograms, frequency polygons,
and ogives.
Represent data using Pareto charts, time series
graphs, and pie graphs

5
Example of Frequency distribution (general)
lower class limit
upper class limit
6

The lower class limit represents the smallest
data value that can be included in the class.
The upper class limit represents the largest
value that can be included in the class.
The class boundaries are used to separate the
classes so that there are no gaps in the
frequency distribution.
Rule of Thumb Class limits should have the same
decimal place value as the data, but the class
boundaries have one additional place value and
end in a 5.
The class width for a class in a frequency
distribution is found by subtracting the lower
(or upper) class limit of one class from the
lower (or upper) class limit of the next class.
The class midpoint is found by adding the lower
and upper boundaries (or limits) and dividing by
2.

Frequency distribution (frequency table) How to
do (in general)?
Determine the number of classes that will be used
to group the data.
a. Minimum 5, maximum 20
b. The actual number depends on such factor
i. The number of observations being group
ii. The purpose of the distribution
iii. The arbitrary preferences of the analyst
c. Use classes that can give you a good view of
the data pattern and enable you
to gain insights into the information that is
there
d. All data items from the smallest to the
largest must be included
e. Each items must be assign to one and only one
class
2. Determine the width (class interval) of
these classes
a. The width should be equal
b. Width range / number of classes
c. Whenever possible an open-ended class interval
(one with an unspecified upper or lower

8
Types of Frequency Distribution

Ungrouped Frequency Distribution
Used for numerical data
The range of data is small

Categorical Frequency Distribution
Used for data that can be placed in specific
categories such as nominal or ordinal level data

Grouped Frequency Distribution
Used for numerical data too
The range of the data is large

9
Twenty-five army inductees were given a blood
test to determine their blood type. The data set
is A B B AB O O O B AB B B B O A O
A O O O AB AB A O B AConstruct a
frequency distribution for the data.
Example Categorical Frequency Distribution
10
Constructing an ungrouped Grouped Frequency
Distribution STEP 1 Determine the classes. -
Find the highest and lowest value. - Find the
range. - Select the number of classes
desired. - Find the width by dividing the
range by the number of
classes and rounding up. - Select a starting
point (usually the lowest value or any
convenient number less than the lowest
value) add the width to get
the lower limits. - Find the upper class
limits. - Find the boundaries. STEP 2 Tally
the data. STEP 3 Find the numerical frequencies
from the tallies. STEP 4 Find the cumulative
frequencies.
11
Example Ungrouped Frequency Distribution
The data shown here represent the number of miles
per gallon that 30 selected four-wheel-drive
sports utility vehicles obtained in city driving.
Construct a frequency distribution. 12 17
12 14 16 18 16 18 12 16 17
15 15 16 12 15 16 16 12 14 15
12 15 15 19 13 16 18 16 14
12
Example Grouped Frequency Distribution

These data represent the record high
temperatures for each of the 50 states.
112 100 127 120 134 118 105 110 109
112
110 118 117 116 118 122 114 114 105
109
107 112 114 115 118 117 118 122 106
110
116 108 110 121 113 120 119 111 104
111
120 113 120 117 105 110 118 112 114
114
Construct a grouped frequency distribution for
the data using 7 classes.
Organize data into a frequency distribution table
with 5 classes . Use 100 to lt107 for the first
class.
Construct a frequency distribution for these
data.

13
Why Construct Frequency Distributions?
To organize the data in a meaningful,
intelligible way.
To enable the reader to make comparisons among
different data sets.
To facilitate computational procedures for
measures of average and spread.
To enable the reader to determine the nature or
shape of the distribution.
To enable the researcher to draw charts and
graphs for the presentation of data.
14
Types of Graph Chart

The purpose of graphs in statistics is to convey
the data to the viewer in pictorial form.
Graphs are useful in getting the audiences
attention in a publication or a presentation.

15
(No Transcript)
16
A. Histogram, Frequency Polygon, Ogive

Histogram
A graph that displays the data by using vertical
bars of various heights to represent the
frequencies

Frequency Polygon
A graph that displays the data by using lines
that connect points plotted for the frequencies
at the midpoints of the classes. The frequencies
represent the heights of the midpoints.

Ogive (Cumulative Frequency Graph)
A graph that represents the cumulative
frequencies for the classes in a frequency
distribution

19
Procedure to construct Histogram, Frequency
Polygon Ogive

STEP 1 Draw and label the x and y axes.
STEP 2 Choose a suitable scale for the
frequencies or cumulative frequencies, and label
it on the y axis.
STEP 3 Represent the class boundaries for the
histogram or ogive, or the midpoint for the
frequency polygon, on the x axis.
STEP 4 Plot the points and then draw the bars or
lines.

20
Example
These data represent the record high temperatures
for each of the 50 states. Construct a grouped
frequency distribution for the data using 7
classes. Then, construct a histogram, frequency
polygon and ogive for these data. 112 100
127 120 134 118 105 110 109 112 110 118
117 116 118 122 114 114 105 109 107 112
114 115 118 117 118 122 106 110 116 108
110 121 113 120 119 111 104 111 120 113
120 117 105 110 118 112 114 114
21
Distribution Shapes

Bell Shaped
Has a single peak tapers off at either end
Approximately symmetry
It is roughly the same on the both sides of a
line running through the center
J-Shaped
Has a few data values on the left side increase
as one move to the right

Uniform
Basically flat/rectangular
Reverse J-Shaped
Opposite J-Shaped
Has a few data values on the right side
increase as one move to the left

22
Distribution Shapes

Right Skewed
The peak is to the left
The data value taper off to the right
Bimodal
Have 2 peak at the same height

Right Skewed
The peak is to the right
The data value taper off to the left
U-Shaped
The shape is U

23
B. Pareto Chart

Used to represent a frequency distribution for a
categorical variable and the frequency are
displayed by the heights of vertical bars.

24
Example
Twenty-five army inductees were given a blood
test to determine their blood type. The data set
is A B B AB O O O B AB B B B O A O A
O O O AB AB A O B AConstruct a pareto
chart for the data.
25
C. Time Series Graph

Represents data that occur over a specified
period of time
STEP 1 Draw and label the x and y axes.
STEP 2 Label the x axis for years and the y
axis for the number of theaters.
STEP 3 Plot each point according to the
table.
STEP 4 Draw line segments connecting
adjacent points. Do not try to fit a
smooth curve through the data points.
We look for a trend or pattern that occurs over
the time period (ascending, descending) the
slope or steepness of the line (increase,
decrease)

Two time series graph for comparisons (compound
time series graph)
26
Example
In 1958, there were more than 4000 outdoor
drive-in theaters. The number of these theaters
has changed over the years. Draw a time series
graph for the data and summarize the
findings. Year Number 1988
1497 1990 910 1992 870 1994
859 1996 826 1998 750 2000 637
27
D. Pie Chart
A pie graph is a circle that is divided into
sections or wedges according to the percentage of
frequencies in each category of the
distribution. The purpose of the pie graph is to
show the relationship of the parts to the whole
by visually comparing the sizes of the sectors.
Percentages or proportions can be used. The
variable is nominal or categorical.
28
Example
Twenty-five army inductees were given a blood
test to determine their blood type. The data set
is A B B AB O O O B AB B B B O A O A
O O O AB AB A O B AConstruct a pie
chart for the data.
29
Stem-and-Leaf Plots
Stem leaf

A stem-and-leaf plot is a data plot that uses
part of a data value as the stem (the leading
digit) and part of the data value as the leaf
(the trailing digit) to form groups or classes.
It has the advantage over grouped frequency
distribution of retaining the actual data while
showing them in graphic form.
Sometime we can construct a mixture model.

Leaf Stem Leaf
30
Example
An insurance company researcher conducted a
survey on the number of car thefts in a large
city for a period of 30 days last summer. The raw
data are shown below. Construct a stem and leaf
plot. 52 62 51 50 69 58 77 66
53 57 75 56 55 67 73 79 59 68
65 72 57 51 63 69 75 65 53 78
66 55
31
Example

The data shown represents the percentage of
unemployed males and females in 1995 for a sample
of countries of the world. Using the whole
numbers as stems and the decimals as leaves,
construct a back-to-back (mixture) stem and leaf
plot and compare the distribution of the two
groups.
Females Males
8.0 3.7 8.6 5.0 8.8 1.9 5.6 4.6
7.0 3.3 8.6 3.2 1.5 6.6 5.6 0.3
8.8 6.8 9.2 5.9 2.2 5.6 3.1 5.9
7.2 4.6 5.6 5.3 9.8 8.7 6.0 5.2
7.7 8.0 8.7 0.5 4.4 9.6 6.6 6.0
6.5 3.4 3.0 9.4 4.6 3.1 4.1 7.7

32
Conclusions (2.1)

Data can be organized in some meaningful way
using frequency distributions. Once the frequency
distribution is constructed, the representation
of the data by graphs is a simple task.

33
2.2 Summary Statistics (Data Description)

Statistical methods can be used to summarize
data.
Measures of average are also called measures of
central tendency and include the mean, median,
mode, and midrange.
Measures that determine the spread of data values
are called measures of variation or measures of
dispersion and include the range, variance, and
standard deviation.
Measures of position tell where a specific data
value falls within the data set or its relative
position in comparison with other data values.
The most common measures of position are
percentiles, deciles, and quartiles.
The measures of central tendency, variation, and
position are part of what is called traditional
statistics. This type of data is typically used
to confirm conjectures about the data

Measures of Central Tendency

Mean the sum of the values divided by the total
number of values.
Population Mean
Sample Mean
Example 9 2 1 4 3 3 7 5 8
6
35
Mean

One computes the mean by using all the values of
the data.
The mean varies less than the median or mode when
samples are taken from the same population and
all three measures are computed for these
samples.
The mean is used in computing other statistics,
such as variance.
The mean for the data set is unique, and not
necessarily one of the data values.
The mean cannot be computed for an open-ended
frequency distribution.
The mean is affected by extremely high or low
values and may not be the appropriate average to
use in these situations

Measures of Central Tendency

Median the middle number of n ordered data
(smallest to largest)
If n is odd
If n is even
Example 9 2 1 4 3 3
7 5 8 6
Example 9 2 1 3 3 7
5 8 6
37
Median

The median is used when one must find the center
or middle value of a data set.
The median is used when one must determine
whether the data values fall into the upper half
or lower half of the distribution.
The median is used to find the average of an
open-ended distribution.
The median is affected less than the mean by
extremely high or extremely low values.

Measures of Central Tendency

Mode the most commonly occurring value in a data
series

The mode is used when the most typical case is
desired.
The mode is the easiest average to compute.
The mode can be used when the data are nominal,
such as religious preference, gender, or
political affiliation.
The mode is not always unique. A data set can
have more than one mode, or the mode may not
exist for a data set.

Example 9 2 1 4 3 3 7 5 8 6
39

Measures of Central Tendency

Midrange is a rough estimate of the middle
also a very rough estimate of the average and can
be affected by one extremely high or low value.
Example 9 2 1 4 3 3 7 5 8 6
40
Types of Distribution
Symmetric
Positively skewed or right-skewed
Negatively skewed or left-skewed
41

Measures of Variation / Dispersion

Used when the central of tendency doesn't mean
anything or not needed (eg mean are same for two
types of data)
One that gauges the variability that exists in a
data set
To form a judgment about how well the average
value illustrate/ depict the data
To learn the extent of the scatter so that steps
may be taken to control the existing variation

Measures of Variation / Dispersion

Range is the different between the highest
value and the lowest value in a data set. The
symbol R is used for the range.
R highest value - lowest value
Example 9 2 1 4 3 3 7 5 8 6
43

Measures of Variation / Dispersion

Variance is the average of the squares of the
distance each value is from the mean.
Population Variance
Sample Variance
Population standard deviation , ?
Sample standard deviation, s
Example 9 2 1 4 3 3
7 5 8 6
Standard Deviation is the square root of the
variance
44
Variance Standard deviation

Variances and standard deviations can be used to
determine the spread of the data. If the variance
or standard deviation is large, the data are more
dispersed. The information is useful in comparing
two or more data sets to determine which is more
variable.
The measures of variance and standard deviation
are used to determine the consistency of a
variable.
The variance and standard deviation are used to
determine the number of data values that fall
within a specified interval in a distribution.
The variance and standard deviation are used
quite often in inferential statistics.

45
Chebychev theorem
46
Describing the position of the data value

Measures of Position

Percentile
Quartile
Deciles
Example1 9 2 1 4 3 3 7 5 8
6 Example 2 19 2 1 4 3 3 7 5
8 6
47
Example

Given
9 2 1 4 3 7 5 4 6
What percentile is the value of 8
Given
9 22 11 14 13 3 7 15 18 16
What percentile is the value of 20

48
Outliers

An outlier is an extremely high or an extremely
low data value when compared with the rest of the
data values.
Outliers can be the result of measurement or
observational error.
When a distribution is normal or bell-shaped,
data values that are beyond three standard
deviations of the mean can be considered
suspected outliers.

Example 9 22 11 14 13 3
7 15 18 16
49
The measures of central tendency, variation, and
position for Grouped data
measures of central tendency
Mean Class
Median class
Mode class
50
measures of Variation
Population variance
Sample variance
51
measures of position
Quartile
Decile
Percentile
52
Example
Find mean, median class, mode class, population
sample variance, quartile, decile percentile.
53
2.3 Exploratory Data Analysis

The purpose of exploratory data analysis is to
examine data in order to find out what
information can be discovered.
For example
Are there any gaps in the data?
Can any patterns be discerned?

54
Boxplots

Boxplots are graphical representations of a
five-number summary of a data set and outliers.
The five specific values that make up a
five-number summary are
The lowest value of data set (minimum)
Q1 (or 25th percentile)
The median (or 50th percentile)
Q3 (or 75th percentile)
The highest value of data set (maximum)

55
STEP to construct a Boxplot

STEP1 Arrange the data
STEP2 Find the Median
STEP3 Find Q1 and Q3
STEP4 Find Outliers
Points that lying more than 1.5 times the
interquartile range above Q3 or below Q1
STEP5 Draw a scale for the data on the x axis.
STEP6 Locate the lowest value, Q1, the median,
Q3, the highest value and outliers on the scale.
STEP7 Draw a box around Q1 and Q3, draw a
vertical line through the median, and connect the
upper and lower values

Example1 9 22 11 14 13 3
7 15 18 16 Example 2 19 2
1 7 5 8 6
56
(No Transcript)
57
CONCLUSIONS

By combining all of these techniques discussed in
this chapter together, the student is now able
to collect, organize, summarize and present data.

Write a Comment

User Comments (0)

About PowerShow.com

DESCRIPTIVE STATISTICS PowerPoint PPT Presentation