Title: MATHSTAT 231
1MATH/STAT 231
- Chapter 2-II
- Univariate Data Description
- Numerical Description
2Introduction
- A graph, such as a histogram, gives us a picture
of our data which can show us the shape of the
distribution, the number of modes and their
locations and other various information. However,
we also want some quantitative information about
the distribution. The most commonly desired
quantitative information are a value describing
the center of the data and a value describing how
spread-out the data is. - 1. Measures of central tendency
- 2. Measures of Dispersion (variation)
3(No Transcript)
4Measures of central tendency
- The three common measures of center are mean,
median and mode. - Mean
- The mean (arithmetic average) is the most
common measure of central tendency. It simply the
sum of the numbers divided by the number of
numbers.
5- Example At a ski rental shop data was collected
on the number of rentals on each of 10
consecutive Saturdays 44, 50, 38, 96, 42, 47,
40, 39, 46, 50. Try to find out the mean. -
6- Example On his first 5 biology tests, Bob
received the following scores 72, 86, 92, 63,
and 77. What test score must Bob earn on his
sixth test so that his average (mean score) for
all six tests will be 80? -
- 72 86 92 63 77 x
80 6cross
multiply - (80)(6) 390
x 480 390x - x 90
7- Median
- The median is the value that occupies in the
middle position when the data are sorted from the
smallest to the largest. Half the data values are
above the median and half are below the median. - Find the median
- 1. Order the observations from the smallest to
the largest. - 2. Select the middle point.
- If there is an odd number (N) of values, the
median M is the (N1)/2 th value (the middle
value). If there is an even number of values,
the median is the average of the two middle
numbers is the average of the two middle numbers
(add the two middle values and divide by 2).
8- Example Test scores for a class of 17 students
are as follows 93, 84, 97, 98, 100, 78, 86, 82,
85, 92, 72, 55, 91, 90, 75, 94, 83. What is the
median score? - Step 1 sort the data from smallest to
largest - 55 72 75 78 82 83 84 85 86 90 91
92 93 94 97 98 100 - Step 2 locate the middle position
- N 17
- Middle position (171) /2 9
- The median 86 (The 9th Number. The data at
the middle position.)
9- Example At a ski rental shop data was collected
on the number of rentals on each of 10
consecutive Saturdays 44, 50, 38, 96, 42, 47,
40, 39, 46, 50. Try to find out the median. - Step 1 sort the data from smallest to the
largest - 38 39 40 42 44 46 47 50 50 96
- Step 2 locate the middle position
- N 10
- The middle position is between
5, 6 - The median (Average of the two middle numbers)
- 38 39 40 42 44 M 46 47 50 50 96
- (44 46 ) /2 45
10- Example The ages of the 667 people participating
in a large workshop (to the nearest year) are
summarized as follows -
- What is the median age of the 667 people?
11- Mode
- The value that occurs most often in a dataset
is called the mode. It is sometimes said to be
the most typical case. - No mode each value occurs only once.
- Example For individuals having the following
ages -- 18, 18, 19, 20, 20, 20, 21, and 23, the
mode age is 20. - Example Find the mode for the following data
5, 15, 10, 15, 5, 10, 10, 20, 25, 15.
12- Try to find the mode based on graphs.
Data values of blood type A B B AB
O O O B AB B B B O A
O A O O O AB AB A O B
A
13Comments
- For symmetric distributions, the mean and median
are close equal. For symmetric and unimodal
distributions, the mean, median and mode are
close equal. - In skewed distributions, the mean is farther out
in the long tail than is the median. So, if the
distribution is skewed to right, then mean gt
mediangt mode. If the distribution is skewed to
left, then mean lt median lt mode.
14- The mean is sensitive to the influence of a few
extreme observations. The median and mode are
more resistant than the mean. - For nominal data (such as sex or race), the mode
is the only valid measure. - For ordinal data (such as salary categories),
only the mode and median can be used.
15(No Transcript)
16Measuring dispersion (variation)
- Range
- Range Max Min
- Example At a ski rental shop data was collected
on the number of rentals on each of 10
consecutive Saturdays 44, 50, 38, 96, 42, 47,
40, 39, 46, 50. Find the range. - Range 96 (the largest value) 38 (the
smallest value)
17- Standard deviation
- Assume we have a dataset with n values
x1,x2,xn. - the mean of the data set.
- Formula for the standard deviation
- Sample variance
-
- or, in more compact notation
- Sample Standard deviation
- Variance and Standard deviation
- Variance s x s s2
- s square root (Variance)
18Example Suppose we wished to find the standard
deviation of the data set consisting of the
values 3, 7, 7, and 19. Step 1 find the mean
(average) of 3, 7, 7, and 19, (3
7 7 19) / 4 9. Step 2 find the
deviation of each number from the mean, 3 - 9
- 6 7 - 9 - 2 7 - 9
- 2 19 - 9 10. Step 3 square each
of the deviations, which amplifies large
deviations and makes negative values positive,
( - 6)2 36 ( - 2)2 4 (
- 2)2 4 102 100. Step 4 find
the variance ( sum of squared deviations/(n-1)
), s2 (36 4 4 100) / (4-1)
48. Step 5 take the non-negative square
root of the variance (s2) s
sqrt(48) 6.93 So, the standard deviation of the
set is 6.93 .
19Another formula to calculate the
variance Example Find the standard deviation
of the data set consisting of the values 3, 7, 7,
and 19. Step 1 Find the mean (3 7 7
19) / 4 9. V2 4 x 92 324
Step 2 Square each of the data value, and add
all of them V1 327272192
468 Step 3 s2 (V1-V2)/(4-1) 48 The
standard deviation sqrt (s2) sqrt (48) 6.93
20- Comments
- 1. A large standard deviation indicates that
the data points are far from the center (Mean)
and a small standard deviation indicates that
they are clustered closely around the center
(Mean). - 2. The standard deviation has the same units
as the data points themselves.
21- Practice use of the standard deviation
- If you repeat a measurement several times on
the same object over the period of measurement,
you may get a series of readings that differ from
each other. The cause may be small differences in
how you use the instrument each time. The
differences could also be due to random changes
in the instrument, and they could be due to small
changes in the object you are measuring. Whatever
the cause, you would be inclined to take the
average of the readings as the best value you can
quote or use. You can get an idea of the
variability from standard deviation of the
readings. The bigger the standard deviation
(variance) is, the less precise the readings are,
and vice versa.
22Measures of position
- How do compare two data values from different
groups. - 1. Five students have taken different forms of
the spelling test. The difficulty level of
different tests could be different. How to
compare their performance? - 2. A training program only accepts top 25
students according to a standard test. what grade
does a student need to make to be in the top 25?
23- Measures of position are used to determine the
relative position of a specified data value
within a group of data values. - Percentile
- Quantile
- z-score
24Percentiles and quartiles
- Percentiles divide the data values into 100 equal
groups. pth percentile of a distribution is the
value such that p percent of data values fall at
or below it. - The median is the 50th percentile.
- The first quartile Q1 (lower quartile)
is the 25th percentile, the third quartile Q3
(upper quartile) is the 75th percentile. The
median is the second quartile. Quartiles divide
the distribution into four equal groups,
separated by Q1, Q2, and Q3.
25- To calculate the quartiles
- 1. Arrange the data values in increasing
order and locate the median. - 2. Use the median to divide the ordered data
values into two halves. Do not include the median
into the halves. - 3. The lower quartile (Q1) is the median of
the lower half of the data. The upper quartile
(Q3) is the median of the upper half of the data.
26- Example. Suppose a group of 10 students have the
following heights (in inches) 60,72, 64, 67, 70,
68, 71, 68, 73, 59. Find Q1 ,Q3 and IQR. - 1 Sort the data from smallest to largest
- 59, 60, 64, 67, 68, 68, 70, 71, 72,
73. - 2 Divide the observations into lower
half and upper half by median 59, 60, 64, 67,
68 M 68, 70, 71, 72, 73. - 3. The first (second) half of the data is
considered in calculating the first (third)
quartile, and Q1 (Q3) is the median of this part
of data - First half 59, 60, 64, 67, 68
- Second half 68, 70, 71, 72, 73
- IQR Q3-Q1 71-64 7
27- Related Measures
- Midquartile
- midquartile (Q1Q3)/2
- The Interquartile Range
- IQR Q3-Q1
- The IQR is essentially the range of the middle
50 of the data.
28Standardized values or z-scores
- If x is an observation (data value) from a
population (a group of data)that has mean µ and
standard deviation s, the z-score (standardized
value) of x is - The z score for an data value, indicates how far
and in what direction, that item deviates from
the mean, expressed in units of the standard
deviation.
29- Example
- 1. Find the z-score corresponding to a raw
score of 132 from a group of data with mean 100
and standard deviation 15. - 2. The lengths of an adult South American rain
forest beetle species are distributed with mean
5.6cm and standard deviation 0.32cm. What is the
z-score for a beetle of length 5.1 cm? - 3.A z-score of 1.7 was found from an
observation coming from a population with mean 14
and standard deviation 3. Find the value of the
observation .
30- The z-score transformation is especially useful
when seeking to compare the relative standings of
items from distributions with different means
and/or different standard deviations. - Five students have taken different forms of the
spelling test. The scores of different forms are
distributed with different mean and standard
deviation. How to compare their performance? -
31The five-number summary and Boxplot
- Five number summary
- Minimum, Q1, Median, Q3, Maximum.
- These five numbers offer a reasonable and
complete description of distribution. - Boxplot (Box and whiskers display)
- Visual version of five number summary. The
five number summary is easier to understand when
it is displayed in a graph.
32- Here are the four steps you follow to draw a
boxplot - 1. Draw a box from the 25th (Q1) to the
75th (Q3) percentile. - 2. Split the box with a line at the median.
- 3. Draw a thin lines from the 75th
percentile up to the maximum value. Draw another
thin line from the 25th percentile down to the
minimum value.
33Information obtained from a Boxplot
__________
_________
___________
Symmetry versus Skewness
34- Example The following boxplot is of the birth
weights (in ounces) of a sample of 160 infants
born in a local hospital. -
1. Describe the shape of the distribution? 2.
Find out the five number summary. 3. About 40
infants of the birth weights were below ____? 4.
IQR ?