Title: Chapter 2 Descriptive Analysis
1Chapter 2 Descriptive Analysis Presentation
of Single-Variable Data
2Chapter Goals
- Learn how to present and describe sets of data
- Learn measures of central tendency, measures of
dispersion (spread), measures of position, and
types of distributions
- Learn how to interpret findings so that we know
what the data is telling us about the sampled
population
32.2 Graphic Presentation of Data
- Use initial exploratory data-analysis techniques
to produce a pictorial representation of the data
- Resulting displays reveal patterns of behavior of
the variable being studied
- The method used is determined by the type of data
and the idea to be presented
- No single correct answer when constructing a
graphic display
4Pie Graphs Bar Graphs
- Pie Graphs and Bar Graphs Graphs that are used
to summarize categorical (some time numerical)
data
- Pie graphs show the amount of data that belongs
to each category as a proportional part of a
circle
- Bar graphs show the amount of data that belongs
to each category as proportionally sized
rectangular areas
5Example
- Example The table below lists the number of
automobiles sold last week by day for a local
dealership. Describe the data using a circle
graph and a bar graph
6Bar Graph Solution
7Key Definitions
- Numerical Data One reason for constructing a
graph of numerical data is to examine the
distribution - is the data compact, spread out,
skewed, symmetric, etc.
Distribution The pattern of variability
displayed by the data of a variable. The
distribution displays the frequency of each value
of the variable.
Dotplot Display Displays the data of a sample by
representing each piece of data with a dot
positioned along a scale. This scale can be
either horizontal or vertical. The frequency of
the values is represented along the other scale.
8- Example A random sample of the lifetime (in
years) of 49 home washing machines is given
below
The figure below is a dotplot for the 49
lifetimes
- .
- . . .. .
- .. ... .. ... . . .
. . . - -------------------------------------
--------------- - 0.0 4.0
8.0 12.0 16.0
20.0
Note Notice how the data is bunched near the
lower extreme and morespread out near the
higher extreme
9Stem-and-Leaf Display
Stem-and-Leaf Display Pictures the data of a
sample using the actual digits that make up the
data values. Each numerical data is divided into
two parts The leading digit(s) becomes the stem,
and the trailing digit(s) becomes the leaf. The
stems are located along the main axis, and a leaf
for each piece of data is located so as to
display the distribution of the data.
10Example
- Example A city police officer, using radar,
checked the speed of cars as they were traveling
down the main street in town. Construct a
stem-and-leaf plot for this data
41 31 33 35 36 37 39 49 33
19 26 27 24 32 40 39 16 55 38
36
Solution All the speeds are in the 10s, 20s,
30s, 40s, and 50s. Use the first digit of each
speed as the stem and the second digit as the
leaf. Draw a vertical line and list the stems,
in order to the left of the line. Place each
leaf on its stem place the trailing digit on the
right side of the vertical line opposite its
corresponding leading digit.
11- 20 Speeds
- ---------------------------------------
- 1 6 9
- 2 4 6 7
- 3 1 2 3 3 5 6 6 7 8 9 9
- 4 0 1 9
- 5 5
- ----------------------------------------
- The speeds are centered around the 30s
Note The display could be constructed so that
only five possible values (instead of ten) could
fall in each stem. What would the stems look
like? Would there be a difference in appearance?
12Remember!
- 1. It is fairly typical of many variables to
display a distribution that is concentrated
(mounded) about a central value and then in some
manner be dispersed in both directions.
2. A display that indicates two mounds may
really be two overlapping distributions
3. A back-to-back stem-and-leaf display makes it
possible to compare two distributions graphically
4. A side-by-side dotplot is also useful for
comparing two distributions
132.3 Frequency Distributions Histograms
- Stem-and-leaf plots often present adequate
summaries, but they can get very big, very fast
- Need other techniques for summarizing data
- Frequency distributions and histograms are used
to summarize large data sets
14Frequency Distributions
- Frequency Distribution A listing, often
expressed in chart form, that pairs each value of
a variable with its frequency
1. A table that summarizes data by classes/class
intervals, is called frequency table 2. In a
typical grouped frequency distribution, there are
usually 5-12 classes of equal width 3. The table
may contain columns for class number, class
interval, tally (if constructing by hand),
frequency, relative frequency, cumulative
relative frequency, and class midpoint 4. In an
ungrouped frequency distribution each class
consists of a single value
15- Guidelines for constructing a frequency
distribution
1. All classes should be of the same
width 2. Classes should be set up so that they do
not overlap and so that each piece of data
belongs to exactly one class 3. For problems in
the text, 5-12 classes are most desirable. The
square root of n is a reasonable guideline for
the number of classes if n is less than
150. 4. Use a system that takes advantage of a
number pattern, to guarantee accuracy 5. If
possible, an even class width is often
advantageous
16Procedure for constructing a frequency
distribution
1. Identify the high (H) and low (L) scores.
Find the range.Range H - L 2. Select a number
of classes and a class width so that the product
is a bit larger than the range 3. Pick a starting
point a little smaller than L. Count from L by
the width to obtain the class boundaries.
Observations that fall on class boundaries are
placed into the class interval to the right.
17Example
- Example The hemoglobin test, a blood test given
to diabetics during their periodic checkups,
indicates the level of control of blood sugar
during the past two to three months. The data in
the table below was obtained for 40 different
diabetics at a university clinic that treats
diabetic patients
- 6.5 5.0 5.6 7.6 4.8 8.0 7.5 7.9
8.0 9.2 - 6.4 6.0 5.6 6.0 5.7 9.2 8.1 8.0
6.5 6.6 - 5.0 8.0 6.5 6.1 6.4 6.6 7.2 5.9
4.0 5.7 - 7.9 6.0 5.6 6.0 6.2 7.7 6.7 7.7
8.2 9.0
18Solutions
1)
- Class Frequency Relative
Cumulative Class - Boundaries f Frequency
Rel. Frequency Midpoint, x - --------------------------------------------------
------------------------------------- - 3.7 - lt4.7 1 0.025 0.025 4.2
- 4.7 - lt5.7 6 0.150 0.175 5.2
- 5.7 - lt6.7 16 0.400 0.575 6.2
- 6.7 - lt7.7
- 7.7 - lt8.7 10 0.250 0.925 8.2
- 8.7 - lt9.7
2) The class 5.7 - lt6.7 has the highest
frequency. The frequency is 16 and the relative
frequency is 0.40
19Histogram
Histogram A bar graph representing a frequency
distribution of a quantitative variable. A
histogram is made up of the following components
1. A title, which identifies the population of
interest 2. A vertical scale, which identifies
the frequencies in the various classes 3. A
horizontal scale, which identifies the variable
x. Values for the class boundaries or class
midpoints may be labeled along the x-axis. Use
whichever method of labeling the axis best
presents the variable.
- Notes
- The relative frequency is sometimes used on the
vertical scale - It is possible to create a histogram based on
class midpoints
20Example
- Example Construct a histogram for the blood test
results given in the previous example
21Example
- Example A recent survey of Roman Catholic nuns
summarized their ages in the table below.
Construct a histogram for this age data
- Age Frequency Class
Midpoint - --------------------------------------------------
---------- - 20 up to 30 34 25
- 30 up to 40 58 35
- 40 up to 50 76 45
- 50 up to 60 187 55
- 60 up to 70 254 65
- 70 up to 80 241 75
- 80 up to 90 147 85
22Solution
23Characteristics of Histograms
Symmetrical Both sides of the distribution are
identical mirror images. There is a line of
symmetry.
Uniform (Rectangular) Every value appears with
equal frequency
Skewed One tail is stretched out longer than the
other. The direction of skewness is on the side
of the longer tail. (Positively skewed vs.
negatively skewed)
J-Shaped There is no tail on the side of the
class with the highest frequency
Bimodal The two largest classes are separated by
one or more classes. Often implies two
populations are sampled.
Normal A symmetrical distribution is mounded
about the mean and becomes sparse at the extremes
24Important Reminders
- Graphical representations of data should include
a descriptive, meaningful title and proper
identification of the vertical and horizontal
scales
25Cumulative Frequency Distribution
Cumulative Frequency Distribution A frequency
distribution that pairs cumulative frequencies
with values of the variable
- The cumulative frequency for any given class is
the sum of the frequency for that class and the
frequencies of all classes of smaller values
- The cumulative relative frequency for any given
class is the sum of the relative frequency for
that class and the relative frequencies of all
classes of smaller values
26Example
- Example A computer science aptitude test was
given to 50 students. The table below summarizes
the data
- Class Relative Cumulative
Cumulative - Boundaries Frequency Frequency
Frequency Rel. Frequency - --------------------------------------------------
---------------------- - 0 up to 4 4 0.08 4 0.08
- 4 up to 8 8 0.16 12 0.24
- 8 up to 12 8 0.16 20 0.40
- 12 up to 16 20 0.40 40 0.80
- 16 up to 20 6 0.12 46 0.92
- 20 up to 24 3 0.06 49 0.98
- 24 up to 28 1 0.02 50 1.00
272.4 Measures of Central Tendency
- Numerical values used to locate the middle of a
set of data, or where the data is clustered
- The term average is often associated with all
measures of central tendency
28Mean
Mean The type of average with which you are
probably most familiar. The mean is the sum of
all the values divided by the total number of
values, n
29Example
- Example The following data represents the number
of accidents in each of the last 6 years at a
dangerous intersection. Find the mean number of
accidents 8, 9, 3, 5, 2, 6, 4, 5
- In the data above, change 6 to 26
Note The mean can be greatly influenced by
outliers
30Median
Median The value of the data that occupies the
middle position when the data are ranked in order
according to size
31Example
- Example Find the median for the set of data
4, 8, 3, 8, 2, 9, 2, 11, 3
32Mode Midrange
Mode The mode is the value of x that occurs most
frequently
Note If two or more values in a sample are tied
for the highest frequency (number of
occurrences), there is no mode
Midrange The number exactly midway between a
lowest value data L and a highest value data H.
It is found by averaging the low and the high
values
33Example
- Example Consider the data set 12.7, 27.1, 35.6,
44.2, 18.0
342.5 Measures of Dispersion
- Measures of central tendency alone cannot
completely characterize a set of data. Two very
different data sets may have similar measures of
central tendency.
- Measures of dispersion are used to describe the
spread, or variability, of a distribution
- Common measures of dispersion range, variance,
interquartile range, and standard deviation
35Range
Range The difference in value between the
highest-valued (H) and the lowest-valued (L)
pieces of data
- Other measures of dispersion are based on the
following quantity
36Example
- Example Consider the sample 12, 23, 17, 15,
18.Find 1) the range and 2) each deviation from
the mean.
37Mean Absolute Deviation
Mean Absolute Deviation The mean of the absolute
values of the deviations from the mean
38Sample Variance Standard Deviation
Sample Variance The sample variance, s2, is the
mean of the squared deviations, calculated using
n - 1 as the divisor
Standard Deviation The standard deviation of a
sample, s, is the positive square root of the
variance
39Example
- Example Find the 1) variance and 2) standard
deviation for the data 5, 7, 1, 3, 8
40Notes
- The shortcut formula for the sample variance
- The unit of measure for the standard deviation is
the same as the unit of measure for the data
412. Percentiles and Quartiles
GRE SCORES
- If your are in the 90th percentile of the
population, what scored higher than you?
- If you are in the 25th percentile, what percent
scored less than or equal to you?
- What percent lie between the 25th and 75th
percentiles?
- What percentile is the median?
- The 25th percentile is also called first
quartile.
- The 75th percentile is also called third quartile.
(We will see these concepts again later.)
422. INTERQUARTILE RANGE (IQR)
- Defined as the difference between the 75th and
25th percentiles - The length of data needed to cover 50 of the
data. - This is a robust measure of variability
- Why do I say it is robust?
432.6 Measures of Position
- Measures of position are used to describe the
relative location of an observation
- Quartiles and percentiles are two of the most
popular measures of position
- An additional measure of central tendency, the
midquartile, is defined using quartiles
- Quartiles are part of the 5-number summary
44Quartiles
Quartiles Values of the variable that divide the
ranked data into quarters each set of data has
three quartiles
- 1. The first quartile, Q1, is a number such that
at most 25 of the data are smaller in value than
Q1 and at most 75 are larger - 2. The second quartile, Q2, is the median
- 3. The third quartile, Q3, is a number such that
at most 75 of the data are smaller in value
than Q3 and at most 25 are larger
45Percentiles
Percentiles Values of the variable that divide a
set of ranked data into 100 equal subsets each
set of data has 99 percentiles. The kth
percentile, Pk, is a value such that at most k
of the data is smaller in value than Pk and at
most (100 - k) of the data is larger.
465-Number Summary
5-Number Summary The 5-number summary is
composed of
- Notes
- The 5-number summary indicates how much the data
is spread out in each quarter - The interquartile range is the difference between
the first and third quartiles. It is the range
of the middle 50 of the data
47Box-plot
Box-plot A graphic representation of the
5-number summary
48Example
- Example A random sample of students in a sixth
grade class was selected. Their weights are
given in the table below. Find the 5-number
summary for this data and construct a boxplot
- 63 64 76 76 81 83 85 86
88 89 90 91 92 93 93 93
94 97 99 99 99 101 108 109 112
49Boxplot for Weight Data
50z-Score
z-Score The position a particular value of x has
relative to the mean, measured in standard
deviations. The z-score is found by the formula
- Notes
- Typically, the calculated value of z is rounded
to the nearest hundredth - The z-score measures the number of standard
deviations above/below, or away from, the mean - z-scores typically range from -3.00 to 3.00
- z-scores may be used to make comparisons of raw
scores
51Example
- Example A certain data set has mean 35.6 and
standard deviation 7.1. Find the z-scores for
46 and 33
522.7 Interpreting Understanding Standard
Deviation
- Standard deviation is a measure of variability,
or spread
- Two rules for describing data rely on the
standard deviation - Empirical rule applies to a variable that is
normally distributed - Chebyshevs theorem applies to any distribution
53Empirical Rule
Empirical Rule If a variable is approximately
normally distributed, then
- 1. Approximately 68 of the observations lie
within 1 standard deviation of the mean - 2. Approximately 95 of the observations lie
within 2 standard deviations of the mean - 3. Approximately 99.7 of the observations lie
within 3 standard deviations of the mean
- Notes
- The empirical rule is more informative than
Chebyshevs theorem since we know more about the
distribution (normally distributed) - Also applies to populations
- Can be used to determine if a distribution is
normally distributed
54Illustration of the Empirical Rule
55Example
- Example A random sample of plum tomatoes was
selected from a local grocery store and their
weights recorded. The mean weight was 6.5 ounces
with a standard deviation of 0.4 ounces. If the
weights are normally distributed
- 1) What percentage of weights fall between 5.7
and 7.3? - 2) What percentage of weights fall above 7.7?
56A Note about the Empirical Rule
Note It is feasible to use empirical rule when a
set of data is approximately normally
distributed. To find out whether the data is
approximately normally distributed, you may use
histogram or box-plot.
57Chebyshevs Theorem
Chebyshevs Theorem The proportion of any
distribution that lies within k standard
deviations of the mean is at least 1 - (1/k2),
where k is any positive number larger than 1.
This theorem applies to all distributions of data.
58Important Reminders!
- Chebyshevs theorem is very conservative and
holds for any distribution of data
- Chebyshevs theorem also applies to any population
- The two most common values used to describe a
distribution of data are k 2, 3
59Example
- Example At the close of trading, a random sample
of 35 technology stocks was selected. The mean
selling price was 67.75 and the standard
deviation was 12.3. Use Chebyshevs theorem (with
k 2, 3) to describe the distribution.
602.8 The Art of Statistical Deception
- Good Arithmetic, Bad Statistics
- Misleading Graphs
- Insufficient Information
61Good Arithmetic, Bad StatisticsThe mean can be
greatly influenced by outliersExample The mean
salary for all NBA players is 15.5 million
Misleading graphs 1. The frequency scale should
start at zero to present a complete picture.
Graphs that do not start at zero are used to save
space. 2. Graphs that start at zero emphasize the
size of the numbers involved 3. Graphs that are
chopped off emphasize variation
62Flight Cancellations
63Flight Cancellations
64Insufficient Information
- Example An admissions officer from a state
school explains that the average tuition at a
nearby private university is 13,000 and only
4500 at his school. This makes the state school
look more attractive.