Title: DESCRIPTIVE STATISTICS
1DESCRIPTIVE STATISTICS
2CONTENT
- 2.1 Statistical Tables and Charts
- 2.2 Summary Statistics (Data Description)
- Measures of Central Tendency
- Measures of Variation
- Measures of Position
- 2.3 Exploratory Data Analysis
3OBJECTIVES
At the end of this chapter, you should be able to
- Organize data using frequency distributions.
- Represent data in frequency distributions
graphically using histograms, frequency polygons,
and ogives. - Represent data using Pareto charts, time series
graphs, and pie graphs. - Draw and interpret a stem and leaf plot.
- Summarize data using measures of central
tendency, such as the mean, median, mode, and
midrange. - Describe data using measures of variation, such
as the range, variance, and standard deviation. - Identify the position of a data value in a data
set, using various measures of position, such as
percentiles, deciles, and quartiles. - Use the techniques of exploratory data analysis,
including boxplots and five-number summaries, to
discover various aspects of data.
42.1 Statistical Tables and Chart
- A. The raw data
- A fresh data that have been collected from any
resource - B. The data array
- An arrangement of data items in either as
ascending (lowest-highest) or descending
(highest-lowest) order. - Frequency Tables (Frequency Distribution)
- Groups data items into classes and then records
the number of items that appear in each class. - Chart and Graph
- Represent data in frequency distributions
graphically using histograms, frequency polygons,
and ogives. - Represent data using Pareto charts, time series
graphs, and pie graphs
5Example of Frequency distribution (general)
lower class limit
upper class limit
6- The lower class limit represents the smallest
data value that can be included in the class. - The upper class limit represents the largest
value that can be included in the class. - The class boundaries are used to separate the
classes so that there are no gaps in the
frequency distribution. - Rule of Thumb Class limits should have the same
decimal place value as the data, but the class
boundaries have one additional place value and
end in a 5. - The class width for a class in a frequency
distribution is found by subtracting the lower
(or upper) class limit of one class from the
lower (or upper) class limit of the next class. - The class midpoint is found by adding the lower
and upper boundaries (or limits) and dividing by
2.
7- Frequency distribution (frequency table) How to
do (in general)? - Determine the number of classes that will be used
to group the data. - a. Minimum 5, maximum 20
- b. The actual number depends on such factor
- i. The number of observations being group
- ii. The purpose of the distribution
- iii. The arbitrary preferences of the analyst
- c. Use classes that can give you a good view of
the data pattern and enable you - to gain insights into the information that is
there - d. All data items from the smallest to the
largest must be included - e. Each items must be assign to one and only one
class - 2. Determine the width (class interval) of
these classes - a. The width should be equal
- b. Width range / number of classes
- c. Whenever possible an open-ended class interval
(one with an unspecified upper or lower
8Types of Frequency Distribution
- Ungrouped Frequency Distribution
- Used for numerical data
- The range of data is small
- Categorical Frequency Distribution
- Used for data that can be placed in specific
categories such as nominal or ordinal level data
- Grouped Frequency Distribution
- Used for numerical data too
- The range of the data is large
9Twenty-five army inductees were given a blood
test to determine their blood type. The data set
is A B B AB O O O B AB B B B O A O
A O O O AB AB A O B AConstruct a
frequency distribution for the data.
Example Categorical Frequency Distribution
10Constructing an ungrouped Grouped Frequency
Distribution STEP 1 Determine the classes. -
Find the highest and lowest value. - Find the
range. - Select the number of classes
desired. - Find the width by dividing the
range by the number of
classes and rounding up. - Select a starting
point (usually the lowest value or any
convenient number less than the lowest
value) add the width to get
the lower limits. - Find the upper class
limits. - Find the boundaries. STEP 2 Tally
the data. STEP 3 Find the numerical frequencies
from the tallies. STEP 4 Find the cumulative
frequencies.
11Example Ungrouped Frequency Distribution
The data shown here represent the number of miles
per gallon that 30 selected four-wheel-drive
sports utility vehicles obtained in city driving.
Construct a frequency distribution. 12 17
12 14 16 18 16 18 12 16 17
15 15 16 12 15 16 16 12 14 15
12 15 15 19 13 16 18 16 14
12Example Grouped Frequency Distribution
- These data represent the record high
temperatures for each of the 50 states. - 112 100 127 120 134 118 105 110 109
112 - 110 118 117 116 118 122 114 114 105
109 - 107 112 114 115 118 117 118 122 106
110 - 116 108 110 121 113 120 119 111 104
111 - 120 113 120 117 105 110 118 112 114
114 - Construct a grouped frequency distribution for
the data using 7 classes. - Organize data into a frequency distribution table
with 5 classes . Use 100 to lt107 for the first
class. - Construct a frequency distribution for these
data.
13Why Construct Frequency Distributions?
To organize the data in a meaningful,
intelligible way.
To enable the reader to make comparisons among
different data sets.
To facilitate computational procedures for
measures of average and spread.
To enable the reader to determine the nature or
shape of the distribution.
To enable the researcher to draw charts and
graphs for the presentation of data.
14Types of Graph Chart
- The purpose of graphs in statistics is to convey
the data to the viewer in pictorial form. - Graphs are useful in getting the audiences
attention in a publication or a presentation.
15(No Transcript)
16A. Histogram, Frequency Polygon, Ogive
- Histogram
- A graph that displays the data by using vertical
bars of various heights to represent the
frequencies
17- Frequency Polygon
- A graph that displays the data by using lines
that connect points plotted for the frequencies
at the midpoints of the classes. The frequencies
represent the heights of the midpoints.
18- Ogive (Cumulative Frequency Graph)
- A graph that represents the cumulative
frequencies for the classes in a frequency
distribution
19Procedure to construct Histogram, Frequency
Polygon Ogive
- STEP 1 Draw and label the x and y axes.
- STEP 2 Choose a suitable scale for the
frequencies or cumulative frequencies, and label
it on the y axis. - STEP 3 Represent the class boundaries for the
histogram or ogive, or the midpoint for the
frequency polygon, on the x axis. - STEP 4 Plot the points and then draw the bars or
lines.
20Example
These data represent the record high temperatures
for each of the 50 states. Construct a grouped
frequency distribution for the data using 7
classes. Then, construct a histogram, frequency
polygon and ogive for these data. 112 100
127 120 134 118 105 110 109 112 110 118
117 116 118 122 114 114 105 109 107 112
114 115 118 117 118 122 106 110 116 108
110 121 113 120 119 111 104 111 120 113
120 117 105 110 118 112 114 114
21Distribution Shapes
- Bell Shaped
- Has a single peak tapers off at either end
- Approximately symmetry
- It is roughly the same on the both sides of a
line running through the center - J-Shaped
- Has a few data values on the left side increase
as one move to the right
- Uniform
- Basically flat/rectangular
- Reverse J-Shaped
- Opposite J-Shaped
- Has a few data values on the right side
increase as one move to the left
22Distribution Shapes
- Right Skewed
- The peak is to the left
- The data value taper off to the right
- Bimodal
- Have 2 peak at the same height
- Right Skewed
- The peak is to the right
- The data value taper off to the left
- U-Shaped
- The shape is U
23B. Pareto Chart
- Used to represent a frequency distribution for a
categorical variable and the frequency are
displayed by the heights of vertical bars.
24Example
Twenty-five army inductees were given a blood
test to determine their blood type. The data set
is A B B AB O O O B AB B B B O A O A
O O O AB AB A O B AConstruct a pareto
chart for the data.
25C. Time Series Graph
- Represents data that occur over a specified
period of time - STEP 1 Draw and label the x and y axes.
- STEP 2 Label the x axis for years and the y
axis for the number of theaters. - STEP 3 Plot each point according to the
table. - STEP 4 Draw line segments connecting
adjacent points. Do not try to fit a
smooth curve through the data points. - We look for a trend or pattern that occurs over
the time period (ascending, descending) the
slope or steepness of the line (increase,
decrease)
Two time series graph for comparisons (compound
time series graph)
26Example
In 1958, there were more than 4000 outdoor
drive-in theaters. The number of these theaters
has changed over the years. Draw a time series
graph for the data and summarize the
findings. Year Number 1988
1497 1990 910 1992 870 1994
859 1996 826 1998 750 2000 637
27D. Pie Chart
A pie graph is a circle that is divided into
sections or wedges according to the percentage of
frequencies in each category of the
distribution. The purpose of the pie graph is to
show the relationship of the parts to the whole
by visually comparing the sizes of the sectors.
Percentages or proportions can be used. The
variable is nominal or categorical.
28Example
Twenty-five army inductees were given a blood
test to determine their blood type. The data set
is A B B AB O O O B AB B B B O A O A
O O O AB AB A O B AConstruct a pie
chart for the data.
29Stem-and-Leaf Plots
Stem leaf
- A stem-and-leaf plot is a data plot that uses
part of a data value as the stem (the leading
digit) and part of the data value as the leaf
(the trailing digit) to form groups or classes. - It has the advantage over grouped frequency
distribution of retaining the actual data while
showing them in graphic form. - Sometime we can construct a mixture model.
Leaf Stem Leaf
30Example
An insurance company researcher conducted a
survey on the number of car thefts in a large
city for a period of 30 days last summer. The raw
data are shown below. Construct a stem and leaf
plot. 52 62 51 50 69 58 77 66
53 57 75 56 55 67 73 79 59 68
65 72 57 51 63 69 75 65 53 78
66 55
31Example
- The data shown represents the percentage of
unemployed males and females in 1995 for a sample
of countries of the world. Using the whole
numbers as stems and the decimals as leaves,
construct a back-to-back (mixture) stem and leaf
plot and compare the distribution of the two
groups. - Females Males
- 8.0 3.7 8.6 5.0 8.8 1.9 5.6 4.6
- 7.0 3.3 8.6 3.2 1.5 6.6 5.6 0.3
- 8.8 6.8 9.2 5.9 2.2 5.6 3.1 5.9
- 7.2 4.6 5.6 5.3 9.8 8.7 6.0 5.2
- 7.7 8.0 8.7 0.5 4.4 9.6 6.6 6.0
- 6.5 3.4 3.0 9.4 4.6 3.1 4.1 7.7
32Conclusions (2.1)
- Data can be organized in some meaningful way
using frequency distributions. Once the frequency
distribution is constructed, the representation
of the data by graphs is a simple task.
332.2 Summary Statistics (Data Description)
- Statistical methods can be used to summarize
data. - Measures of average are also called measures of
central tendency and include the mean, median,
mode, and midrange. - Measures that determine the spread of data values
are called measures of variation or measures of
dispersion and include the range, variance, and
standard deviation. - Measures of position tell where a specific data
value falls within the data set or its relative
position in comparison with other data values. - The most common measures of position are
percentiles, deciles, and quartiles. - The measures of central tendency, variation, and
position are part of what is called traditional
statistics. This type of data is typically used
to confirm conjectures about the data
34- Measures of Central Tendency
Mean the sum of the values divided by the total
number of values.
Population Mean
Sample Mean
Example 9 2 1 4 3 3 7 5 8
6
35Mean
- One computes the mean by using all the values of
the data. - The mean varies less than the median or mode when
samples are taken from the same population and
all three measures are computed for these
samples. - The mean is used in computing other statistics,
such as variance. - The mean for the data set is unique, and not
necessarily one of the data values. - The mean cannot be computed for an open-ended
frequency distribution. - The mean is affected by extremely high or low
values and may not be the appropriate average to
use in these situations
36- Measures of Central Tendency
Median the middle number of n ordered data
(smallest to largest)
If n is odd
If n is even
Example 9 2 1 4 3 3
7 5 8 6
Example 9 2 1 3 3 7
5 8 6
37Median
- The median is used when one must find the center
or middle value of a data set. - The median is used when one must determine
whether the data values fall into the upper half
or lower half of the distribution. - The median is used to find the average of an
open-ended distribution. - The median is affected less than the mean by
extremely high or extremely low values.
38- Measures of Central Tendency
Mode the most commonly occurring value in a data
series
- The mode is used when the most typical case is
desired. - The mode is the easiest average to compute.
- The mode can be used when the data are nominal,
such as religious preference, gender, or
political affiliation. - The mode is not always unique. A data set can
have more than one mode, or the mode may not
exist for a data set.
Example 9 2 1 4 3 3 7 5 8 6
39- Measures of Central Tendency
Midrange is a rough estimate of the middle
also a very rough estimate of the average and can
be affected by one extremely high or low value.
Example 9 2 1 4 3 3 7 5 8 6
40Types of Distribution
Symmetric
Positively skewed or right-skewed
Negatively skewed or left-skewed
41- Measures of Variation / Dispersion
- Used when the central of tendency doesn't mean
anything or not needed (eg mean are same for two
types of data) - One that gauges the variability that exists in a
data set - To form a judgment about how well the average
value illustrate/ depict the data - To learn the extent of the scatter so that steps
may be taken to control the existing variation
42- Measures of Variation / Dispersion
Range is the different between the highest
value and the lowest value in a data set. The
symbol R is used for the range.
R highest value - lowest value
Example 9 2 1 4 3 3 7 5 8 6
43- Measures of Variation / Dispersion
Variance is the average of the squares of the
distance each value is from the mean.
Population Variance
Sample Variance
Population standard deviation , ?
Sample standard deviation, s
Example 9 2 1 4 3 3
7 5 8 6
Standard Deviation is the square root of the
variance
44Variance Standard deviation
- Variances and standard deviations can be used to
determine the spread of the data. If the variance
or standard deviation is large, the data are more
dispersed. The information is useful in comparing
two or more data sets to determine which is more
variable. - The measures of variance and standard deviation
are used to determine the consistency of a
variable. - The variance and standard deviation are used to
determine the number of data values that fall
within a specified interval in a distribution. - The variance and standard deviation are used
quite often in inferential statistics.
45Chebychev theorem
46Describing the position of the data value
Percentile
Quartile
Deciles
Example1 9 2 1 4 3 3 7 5 8
6 Example 2 19 2 1 4 3 3 7 5
8 6
47Example
- Given
- 9 2 1 4 3 7 5 4 6
- What percentile is the value of 8
- Given
- 9 22 11 14 13 3 7 15 18 16
- What percentile is the value of 20
48Outliers
- An outlier is an extremely high or an extremely
low data value when compared with the rest of the
data values. - Outliers can be the result of measurement or
observational error. - When a distribution is normal or bell-shaped,
data values that are beyond three standard
deviations of the mean can be considered
suspected outliers.
Example 9 22 11 14 13 3
7 15 18 16
49The measures of central tendency, variation, and
position for Grouped data
measures of central tendency
Mean Class
Median class
Mode class
50measures of Variation
Population variance
Sample variance
51measures of position
Quartile
Decile
Percentile
52Example
Find mean, median class, mode class, population
sample variance, quartile, decile percentile.
532.3 Exploratory Data Analysis
- The purpose of exploratory data analysis is to
examine data in order to find out what
information can be discovered.
For example - Are there any gaps in the data?
- Can any patterns be discerned?
54Boxplots
- Boxplots are graphical representations of a
five-number summary of a data set and outliers.
The five specific values that make up a
five-number summary are - The lowest value of data set (minimum)
- Q1 (or 25th percentile)
- The median (or 50th percentile)
- Q3 (or 75th percentile)
- The highest value of data set (maximum)
55STEP to construct a Boxplot
- STEP1 Arrange the data
- STEP2 Find the Median
- STEP3 Find Q1 and Q3
- STEP4 Find Outliers
- Points that lying more than 1.5 times the
interquartile range above Q3 or below Q1 - STEP5 Draw a scale for the data on the x axis.
- STEP6 Locate the lowest value, Q1, the median,
Q3, the highest value and outliers on the scale. - STEP7 Draw a box around Q1 and Q3, draw a
vertical line through the median, and connect the
upper and lower values
Example1 9 22 11 14 13 3
7 15 18 16 Example 2 19 2
1 7 5 8 6
56(No Transcript)
57CONCLUSIONS
- By combining all of these techniques discussed in
this chapter together, the student is now able
to collect, organize, summarize and present data.