Numerical Descriptive Techniques - PowerPoint PPT Presentation

About This Presentation

Title:

Numerical Descriptive Techniques

Description:

Title: Numerical Descriptive Measures Author: sbae Last modified by: JONG-MIN KIM Created Date: 3/31/1999 2:22:02 AM Document presentation format – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 68

Provided by: sbae9

Learn more at: https://facultypages.morris.umn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Numerical Descriptive Techniques

1
Numerical Descriptive Techniques

Chapter 4

2
4.2 Measures of Central Location

Usually, we focus our attention on two types of
measures when describing population
characteristics
Central location (e.g. average)
Variability or spread

The measure of central location reflects the
locations of all the actual data points.
3
4.2 Measures of Central Location

The measure of central location reflects the
locations of all the actual data points.
How?

With two data points, the central location
should fall in the middle between them (in order
to reflect the location of both of them).
But if the third data point appears on the left
hand-side of the midrange, it should pull the
central location to the left.
4
The Arithmetic Mean

This is the most popular and useful measure of
central location

5
The Arithmetic Mean
Sample mean
Population mean
Sample size
Population size
6
The Arithmetic Mean
The arithmetic mean

Example 4.1

The reported time on the Internet of 10 adults
are 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 hours. Find
the mean time on the Internet.
0
7
22
11.0
42.19
38.45
45.77
43.59
7
The Median

The Median of a set of observations is the value
that falls in the middle when the observations
are arranged in order of magnitude.

Odd number of observations
8
0, 0, 5, 7, 8 9, 12, 14, 22
8.5,
0, 0, 5, 7, 8, 9, 12, 14, 22, 33
8
The Mode

The Mode of a set of observations is the value
that occurs most frequently.
Set of data may have one mode (or modal class),
or two or more modes.

For large data sets the modal class is much more
relevant than a single-value mode.
The modal class
9
The Mode
The Mode
The Mean, Median, Mode

Example 4.5Find the mode for the data in Example
4.1. Here are the data again 0, 7, 12, 5, 33,
14, 8, 0, 9, 22
Solution
All observation except 0 occur once. There are
two 0. Thus, the mode is zero.
Is this a good measure of central location?
The value 0 does not reside at the center of
this set(compare with the mean 11.0 and the
mode 8.5).

10
Relationship among Mean, Median, and Mode

If a distribution is symmetrical, the mean,
median and mode coincide

If a distribution is asymmetrical, and skewed to
the left or to the right, the three measures
differ.

A positively skewed distribution (skewed to the
right)
Mode
Mean
Median
11
Relationship among Mean, Median, and Mode

If a distribution is symmetrical, the mean,
median and mode coincide

If a distribution is non symmetrical, and skewed
to the left or to the right, the three measures
differ.

A negatively skewed distribution (skewed to the
left)
A positively skewed distribution (skewed to the
right)
Mean
Mode
Mean
Mode
Median
Median
12
The Geometric Mean

This is a measure of the average growth rate.
Let Ri denote the the rate of return in period i
(i1,2,n). The geometric mean of the returns
R1, R2, ,Rn is the constant Rg that produces the
same terminal wealth at the end of period n as do
the actual returns for the n periods.

13
The Geometric Mean
The Geometric Mean
For the given series of rate of returns the nth
period return is calculated by
If the rate of return was Rg in every period, the
nth period return would be calculated by

Rg is selected such that
14
4.3 Measures of variability

Measures of central location fail to tell the
whole story about the distribution.
A question of interest still remains unanswered

How much are the observations spread out around
the mean value?
15
4.3 Measures of variability
Observe two hypothetical data sets
Small variability
The average value provides a good representation
of the observations in the data set.
This data set is now changing to...
16
4.3 Measures of variability
Observe two hypothetical data sets
Small variability
The average value provides a good representation
of the observations in the data set.
Larger variability
The same average value does not provide as good
representation of the observations in the data
set as before.
17
The range

The range of a set of observations is the
difference between the largest and smallest
observations.
Its major advantage is the ease with which it can
be computed.
Its major shortcoming is its failure to provide
information on the dispersion of the observations
between the two end points.

But, how do all the observations spread out?
The range cannot assist in answering this question
Range
Largest observation
Smallest observation
18
The Variance
19
Why not use the sum of deviations?
Consider two small populations
9-10 -1
A measure of dispersion Should agrees with this
observation.
11-10 1
Can the sum of deviations Be a good measure of
dispersion?
The sum of deviations is zero for both
populations, therefore, is not a good measure of
dispersion.
8-10 -2
A
12-10 2
10
9
8
11
12
The mean of both populations is 10...
but measurements in B are more dispersed then
those in A.
4-10 - 6
16-10 6
B
7-10 -3
13-10 3
7
4
10
13
16
20
The Variance
Let us calculate the variance of the two
populations
Why is the variance defined as the average
squared deviation? Why not use the sum of squared
deviations as a measure of variation instead?
After all, the sum of squared deviations
increases in magnitude when the variation of a
data set increases!!
21
The Variance
Let us calculate the sum of squared deviations
for both data sets
Which data set has a larger dispersion?
Data set B is more dispersed around the mean
A
B
1
3
1
3
2
5
22
The Variance
SumA gt SumB. This is inconsistent with the
observation that set B is more dispersed.
A
B
1
3
1
3
2
5
23
The Variance
However, when calculated on per observation
basis (variance), the data set dispersions are
properly ranked.
sA2 SumA/N 10/5 2
sB2 SumB/N 8/2 4
A
B
1
3
1
3
2
5
24
The Variance

Example 4.7
The following sample consists of the number of
jobs six students applied for 17, 15, 23, 7, 9,
13. Finds its mean and variance
Solution

25
The Variance Shortcut method
26
Standard Deviation

The standard deviation of a set of observations
is the square root of the variance .

27
Standard Deviation

Example 4.8
To examine the consistency of shots for a new
innovative golf club, a golfer was asked to hit
150 shots, 75 with a currently used (7-iron)
club, and 75 with the new club.
The distances were recorded.
Which 7-iron is more consistent?

28
Standard Deviation
The Standard Deviation

Example 4.8 solution

Excel printout, from the Descriptive
Statistics sub-menu.
The innovation club is more consistent, and
because the means are close, is considered a
better club
29
Interpreting Standard Deviation

The standard deviation can be used to
compare the variability of several distributions
make a statement about the general shape of a
distribution.
The empirical rule If a sample of observations
has a mound-shaped distribution, the interval

30
Interpreting Standard Deviation

Example 4.9A statistics practitioner wants to
describe the way returns on investment are
distributed.
The mean return 10
The standard deviation of the return 8
The histogram is bell shaped.

31
Interpreting Standard Deviation

Example 4.9 solution
The empirical rule can be applied (bell shaped
histogram)
Describing the return distribution
Approximately 68 of the returns lie between 2
and 18 10 1(8), 10
1(8)
Approximately 95 of the returns lie between -6
and 26 10 2(8), 10
2(8)
Approximately 99.7 of the returns lie between
-14 and 34 10
3(8), 10 3(8)

32
The Chebysheffs Theorem

The proportion of observations in any sample that
lie within k standard deviations of the mean is
at least 1-1/k2 for k gt 1.
This theorem is valid for any set of measurements
(sample, population) of any shape!!
K Interval Chebysheff Empirical Rule
1 at least 0 approximately 68
2 at least 75 approximately 95
3 at least 89 approximately 99.7

(1-1/12)
(1-1/22)
(1-1/32)
33
The Chebysheffs Theorem

Example 4.10
The annual salaries of the employees of a chain
of computer stores produced a positively skewed
histogram. The mean and standard deviation are
28,000 and 3,000,respectively. What can you say
about the salaries at this chain?
SolutionAt least 75 of the salaries lie
between 22,000 and 34,000
28000 2(3000)
28000 2(3000)
At least 88.9 of the salaries lie between
19,000 and 37,000
28000 3(3000) 28000
3(3000)

34
The Coefficient of Variation

The coefficient of variation of a set of
measurements is the standard deviation divided by
the mean value.
This coefficient provides a proportionate measure
of variation.

A standard deviation of 10 may be perceived large
when the mean value is 100, but only moderately
large when the mean value is 500
35
4.4 Measures of Relative Standing and Box
Plots

Percentile
The pth percentile of a set of measurements is
the value for which
p percent of the observations are less than that
value
100(1-p) percent of all the observations are
greater than that value.
Example
Suppose your score is the 60 percentile of a SAT
test. Then

40
60 of all the scores lie here
36
Quartiles

Commonly used percentiles
First (lower)decile 10th percentile
First (lower) quartile, Q1, 25th percentile
Second (middle)quartile,Q2, 50th percentile
Third quartile, Q3, 75th percentile
Ninth (upper)decile 90th percentile

37
Quartiles

Example
Find the quartiles of the following set of
measurements 7, 8, 12, 17, 29, 18, 4, 27, 30, 2,
4, 10, 21, 5, 8

38
Quartiles

Solution
Sort the observations
2, 4, 4, 5, 7, 8, 10, 12, 17, 18, 18, 21, 27, 29,
30

The first quartile
At most (.25)(15) 3.75 observations should
appear below the first quartile. Check the first
3 observations on the left hand side.
At most (.75)(15)11.25 observations should
appear above the first quartile. Check 11
observations on the right hand side.
CommentIf the number of observations is even,
two observations remain unchecked. In this case
choose the midpoint between these two
observations.
39
Location of Percentiles

Find the location of any percentile using the
formula
Example 4.11
Calculate the 25th, 50th, and 75th percentile of
the data in Example 4.1

40
Location of Percentiles

Example 4.11 solution
After sorting the data we have 0, 0, 5, 7, 8, 9,
22, 33.

41
Location of Percentiles

Example 4.11 solution continued
The 50th percentile is halfway between the fifth
and sixth observations (in the middle between 8
and 9), that is 8.5.

42
Location of Percentiles

Example 4.11 solution continued
The 75th percentile is one quarter of the
distance between the eighth and ninth observation
that is14.25(22 14) 16.

Eighth observation
Ninth observation
43
Quartiles and Variability

Quartiles can provide an idea about the shape of
a histogram

Q1 Q2 Q3
Q1 Q2 Q3
Positively skewed histogram
Negatively skewed histogram
44
Interquartile Range

This is a measure of the spread of the middle 50
of the observations
Large value indicates a large spread of the
observations

Interquartile range Q3 Q1
45
Box Plot

This is a pictorial display that provides the
main descriptive measures of the data set
L - the largest observation
Q3 - The upper quartile
Q2 - The median
Q1 - The lower quartile
S - The smallest observation

S
Q1
Q2
Q3
L
46
Box Plot

Example 4.14 (Xm02-01)

Left hand boundary 9.2751.5(IQR)
-104.226 Right hand boundary84.9425
1.5(IQR)198.4438
0
9.275
198.4438
-104.226
84.9425
119.63
26.905
No outliers are found
47
Box Plot

Additional Example - GMAT scores
Create a box plot for the data regarding the
GMAT scores of 200 applicants (see GMAT.XLS)

537
512
449
575
417.5
669.5
788
5751.5(IQR)
512-1.5(IQR)
48
Box Plot
GMAT - continued
Q1 512
Q2 537
Q3 575
449
669.5
25
50
25

Interpreting the box plot results
The scores range from 449 to 788.
About half the scores are smaller than 537, and
about half are larger than 537.
About half the scores lie between 512 and 575.
About a quarter lies below 512 and a quarter
above 575.

49
Box Plot
GMAT - continued
The histogram is positively skewed
Q1 512
Q2 537
Q3 575
449
669.5
25
50
25
50
25
25
50
Box Plot

Example 4.15 (Xm04-15)
A study was organized to compare the quality of
service in 5 drive through restaurants.
Interpret the results
Example 4.15 solution
Minitab box plot

51
Box Plot
Jack in the Box
Jack in the box is the slowest in service
Hardees service time variability is the largest
Hardees
McDonalds
Wendys service time appears to be the shortest
and most consistent.
Wendys
Popeyes
52
Box Plot
Jack in the Box
Jack in the box is the slowest in service
Hardees service time variability is the largest
Hardees
McDonalds
Wendys service time appears to be the shortest
and most consistent.
Wendys
Popeyes
53
4.5 Measures of Linear Relationship

The covariance and the coefficient of correlation
are used to measure the direction and strength of
the linear relationship between two variables.
Covariance - is there any pattern to the way two
variables move together?
Coefficient of correlation - how strong is the
linear relationship between two variables

54
Covariance
mx (my) is the population mean of the variable X
(Y). N is the population size.
55
Covariance

Compare the following three sets

xi yi (x x) (y y) (x x)(y y)
2 6 7 13 20 27 -3 1 2 -7 0 7 21 0 14
x5 y 20 Cov(x,y)17.5
xi yi
2 6 7 20 27 13 Cov(x,y) -3.5
x5 y 20
xi yi (x x) (y y) (x x)(y y)
2 6 7 27 20 13 -3 1 2 7 0 -7 -21 0 -14
x5 y 20 Cov(x,y)-17.5
56
Covariance

If the two variables move in the same direction,
(both increase or both decrease), the covariance
is a large positive number.

If the two variables move in opposite directions,
(one increases when the other one decreases), the
covariance is a large negative number.
If the two variables are unrelated, the
covariance will be close to zero.

57
The coefficient of correlation

This coefficient answers the question How strong
is the association between X and Y.

58
The coefficient of correlation
1 0 -1
Strong positive linear relationship
COV(X,Y)gt0
or
r or r
No linear relationship
COV(X,Y)0
COV(X,Y)lt0
Strong negative linear relationship
59
The coefficient of correlation

If the two variables are very strongly positively
related, the coefficient value is close to 1
(strong positive linear relationship).
If the two variables are very strongly negatively
related, the coefficient value is close to -1
(strong negative linear relationship).
No straight line relationship is indicated by a
coefficient close to zero.

60
The coefficient of correlation and the
covariance Example 4.16

Compute the covariance and the coefficient of
correlation to measure how GMAT scores and GPA in
an MBA program are related to one another.
Solution
We believe GMAT affects GPA. Thus
GMAT is labeled X
GPA is labeled Y

61
The coefficient of correlation and the
covariance Example 4.16
Student x y x2 y2
xy
9.6
1
599
358801
92.16
5750.4
cov(x,y)(1/12-1)67,559.2-(7587)(106.4)/1226.16
Sx (1/12-1)4,817,755-(7587)2/12).543.56 S
y similar to Sx 1.12 r cov(x,y)/SxSy
26.16/(43.56)(1.12) .5362
2
689
8.8
474721
77.44
6063.2
3
584
7.4
341056
54.76
4321.6
.
4
631
10
398161
100
6310
11
593
8.8
351649
77.44
5218.4
12
683
8
466489
64
5464
Total
7,587
106.4
4,817,755
957.2
67,559.2
62
The coefficient of correlation and the
covariance Example 4.16 Excel

Use the Covariance option in Data Analysis
If your version of Excel returns the population
covariance and variances, multiply each one by
n/n-1 to obtain the corresponding sample values.
Use the Correlation option to produce the
correlation matrix.

Variance-Covariance Matrix
GPA GMAT
GPA 1.15
GMAT 23.98 1739.52
GPA GMAT
GPA 1.25
GMAT 26.16 1897.66
Population values
Sample values
Population values
Sample values
63
The coefficient of correlation and the
covariance Example 4.16 Excel

Interpretation
The covariance (26.16) indicates that GMAT score
and performance in the MBA program are positively
related.
The coefficient of correlation (.5365) indicates
that there is a moderately strong positive linear
relationship between GMAT and MBA GPA.

64
The Least Squares Method

We are seeking a line that best fits the data
when two variables are (presumably) related to
one another.
We define best fit line as a line for which the
sum of squared differences between it and the
data points is minimized.

The y value of point i calculated from the
equation
The actual y value of point i
65
The least Squares Method
Y
X
Different lines generate different errors, thus
different sum of squares of errors.
There is a line that minimizes the sum of squared
errors
66
The least Squares Method
The coefficients b0 and b1 of the line that
minimizes the sum of squares of errors are
calculated from the data.
67
The Least Squares Method