Measure of Variability (Dispersion, Spread) - PowerPoint PPT Presentation

About This Presentation
Title:

Measure of Variability (Dispersion, Spread)

Description:

2/3 of males aged 60-65 have blood pressure within 12 of 155. i.e. between 155 ... Urban = percentage of population living in cities. Religion ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 149
Provided by: lave9
Category:

less

Transcript and Presenter's Notes

Title: Measure of Variability (Dispersion, Spread)


1
Measure of Variability (Dispersion, Spread)
  1. Range
  2. Inter-Quartile Range
  3. Variance, standard deviation
  4. Pseudo-standard deviation

2
Measure of Central Location
  1. Mean
  2. Median

3
Range
  • R Range max - min
  1. Inter-Quartile Range (IQR)

Inter-Quartile Range IQR Q3 - Q1
4
Example
  • The data Verbal IQ on n 23 students arranged in
    increasing order is
  • 80 82 84 86 86 89 90 94 94 95 95 96 99 99 102 102
    104 105 105 109 111 118 119

Q3 105
Q2 96
Q1 89
min 80
max 119
5
Range and IQR
  • Range max min 119 80 39
  • Inter-Quartile Range
  • IQR Q3 - Q1 105 89 16

6
Sample Variance
  • Let x1, x2, x3, xn denote a set of n numbers.
  • Recall the mean of the n numbers is defined as

7
  • The numbers
  • are called deviations from the the mean

8
  • The sum
  • is called the sum of squares of deviations from
    the the mean.
  • Writing it out in full
  • or

9
The Sample Variance
  • Is defined as the quantity
  • and is denoted by the symbol

10
The Sample Standard Deviation s
  • Definition The Sample Standard Deviation is
    defined by
  • Hence the Sample Standard Deviation, s, is the
    square root of the sample variance.

11
  • Example
  • Let x1, x2, x3, x4, x5 denote a set of 5 denote
    the set of numbers in the following table.

12
  • Then
  • x1 x2 x3 x4 x5
  • 10 15 21 7 13
  • 66
  • and

13
  • The deviations from the mean d1, d2, d3, d4, d5
    are given in the following table.

14
  • The sum
  • and

15
  • Also the standard deviation is

16
Interpretations of s
  • In Normal distributions
  • Approximately 2/3 of the observations will lie
    within one standard deviation of the mean
  • Approximately 95 of the observations lie within
    two standard deviations of the mean
  • In a histogram of the Normal distribution, the
    standard deviation is approximately the distance
    from the mode to the inflection point

17
Mode
Inflection point
s
18
2/3
s
s
19
2s
20
Example
  • A researcher collected data on 1500 males aged
    60-65.
  • The variable measured was cholesterol and blood
    pressure.
  • The mean blood pressure was 155 with a standard
    deviation of 12.
  • The mean cholesterol level was 230 with a
    standard deviation of 15
  • In both cases the data was normally distributed

21
Interpretation of these numbers
  • Blood pressure levels vary about the value 155 in
    males aged 60-65.
  • Cholesterol levels vary about the value 230 in
    males aged 60-65.

22
  • 2/3 of males aged 60-65 have blood pressure
    within 12 of 155. i.e. between 155-12 143 and
    15512 167.
  • 2/3 of males aged 60-65 have Cholesterol within
    15 of 230. i.e. between 230-15 215 and 23015
    245.

23
  • 95 of males aged 60-65 have blood pressure
    within 2(12) 24 of 155. Ii.e. between 155-24
    131 and 15524 179.
  • 95 of males aged 60-65 have Cholesterol within
    2(15) 30 of 230. i.e. between 230-30 200 and
    23030 260.

24
A Computing formula for
  • Sum of squares of deviations from the the mean
  • The difficulty with this formula is that
    will have many decimals.
  • The result will be that each term in the above
    sum will also have many decimals.

25
  • The sum of squares of deviations from the the
    mean can also be computed using the following
    identity

26
  • To use this identity we need to compute

27
  • Then

28
(No Transcript)
29
Example
  • The data Verbal IQ on n 23 students arranged in
    increasing order is
  • 80 82 84 86 86 89 90 94
  • 94 95 95 96 99 99 102 102
  • 104 105 105 109 111 118 119

30
  • 80 82 84 86 86 89
  • 90 94 94 95 95 96 99 99
    102 102 104
  • 105 105 109 111 118 119 2244
  • 802 822 842 862 862 892
  • 902 942 942 952 952 962 992
    992 1022 1022 1042
  • 1052 1052 1092 1112
  • 1182 1192 221494

31
  • Then

You will obtain exactly the same answer if you
use the left hand side of the equation
32
(No Transcript)
33
(No Transcript)
34
A quick (rough) calculation of s
  • The reason for this is that approximately all
    (95) of the observations are between
  • and
  • Thus

35
Example
  • Verbal IQ on n 23 students
  • min 80 and max 119
  • This compares with the exact value of s which is
    10.782.
  • The rough method is useful for checking your
    calculation of s.

36
The Pseudo Standard Deviation (PSD)
  • Definition The Pseudo Standard Deviation (PSD)
    is defined by

37
Properties
  • For Normal distributions the magnitude of the
    pseudo standard deviation (PSD) and the standard
    deviation (s) will be approximately the same
    value
  • For leptokurtic distributions the standard
    deviation (s) will be larger than the pseudo
    standard deviation (PSD)
  • For platykurtic distributions the standard
    deviation (s) will be smaller than the pseudo
    standard deviation (PSD)

38
Example
  • Verbal IQ on n 23 students
  • Inter-Quartile Range
  • IQR Q3 - Q1 105 89 16
  • Pseudo standard deviation
  • This compares with the standard deviation

39
  • An outlier is a wild observation in the data
  • Outliers occur because
  • of errors (typographical and computational)
  • Extreme cases in the population
  • We will now consider the drawing of box-plots
    where outliers are identified

40
Box-whisker Plots showing outliers
41
  • An outlier is a wild observation in the data
  • Outliers occur because
  • of errors (typographical and computational)
  • Extreme cases in the population
  • We will now consider the drawing of box-plots
    where outliers are identified

42
To Draw a Box Plot we need to
  • Compute the Hinge (Median, Q2) and the Mid-hinges
    (first third quartiles Q1 and Q3 )
  • To identify outliers we will compute the inner
    and outer fences

43
  • The fences are like the fences at a prison. We
    expect the entire population to be within both
    sets of fences.
  • If a member of the population is between the
    inner and outer fences it is a mild outlier.
  • If a member of the population is outside of the
    outer fences it is an extreme outlier.

44
  • Lower outer fence
  • F1 Q1 - (3)IQR
  • Upper outer fence
  • F2 Q3 (3)IQR

45
  • Lower inner fence
  • f1 Q1 - (1.5)IQR
  • Upper inner fence
  • f2 Q3 (1.5)IQR

46
  • Observations that are between the lower and upper
    fences are considered to be non-outliers.
  • Observations that are outside the inner fences
    but not outside the outer fences are considered
    to be mild outliers.
  • Observations that are outside outer fences are
    considered to be extreme outliers.

47
  • mild outliers are plotted individually in a
    box-plot using the symbol
  • extreme outliers are plotted individually in a
    box-plot using the symbol
  • non-outliers are represented with the box and
    whiskers with
  • Max largest observation within the fences
  • Min smallest observation within the fences

48
Extreme outlier
Box-Whisker plot representing the data that are
not outliers
Mild outliers
Inner fences
Outer fence
49
Example
  • Data collected on n 109 countries in 1995.
  • Data collected on k 25 variables.

50
The variables
  1. Population Size (in 1000s)
  2. Density Number of people/Sq kilometer
  3. Urban percentage of population living in cities
  4. Religion
  5. lifeexpf Average female life expectancy
  6. lifeexpm Average male life expectancy

51
  1. literacy of population who read
  2. pop_inc increase in popn size (1995)
  3. babymort Infant motality (deaths per 1000)
  4. gdp_cap Gross domestic product/capita
  5. Region Region or economic group
  6. calories Daily calorie intake.
  7. aids Number of aids cases
  8. birth_rt Birth rate per 1000 people

52
  1. death_rt death rate per 1000 people
  2. aids_rt Number of aids cases/100000 people
  3. log_gdp log10(gdp_cap)
  4. log_aidsr log10(aids_rt)
  5. b_to_d birth to death ratio
  6. fertility average number of children in family
  7. log_pop log10(population)

53
  1. cropgrow ??
  2. lit_male of males who can read
  3. lit_fema of females who can read
  4. Climate predominant climate

54
The data file as it appears in SPSS
55
Consider the data on infant mortality
Stem-Leaf diagram stem 10s, leaf unit digit
56
Summary Statistics
median Q2 27
Quartiles Lower quartile Q1 the median of
lower half Upper quartile Q3 the median of
upper half
Interquartile range (IQR) IQR Q1 - Q3 66.5
12 54.5
57
The Outer Fences
lower Q1 - 3(IQR) 12 3(54.5) - 151.5
upper Q3 3(IQR) 66.5 3(54.5) 230.0
No observations are outside of the outer fences
The Inner Fences
lower Q1 1.5(IQR) 12 1.5(54.5) - 69.75
upper Q3 1.5(IQR) 66.5 1.5(54.5) 148.25
Only one observation (168 Afghanistan) is
outside of the inner fences (mild outlier)
58
Box-Whisker Plot of Infant Mortality
Infant Mortality
59
Example 2
  • In this example we are looking at the weight
    gains (grams) for rats under six diets differing
    in level of protein (High or Low) and source of
    protein (Beef, Cereal, or Pork).
  • Ten test animals for each diet

60
Table Gains in weight (grams) for rats under six
diets differing in level of protein (High or
Low) and source of protein (Beef, Cereal, or
Pork)
 
61
High Protein
Low Protein
Beef
Cereal
Pork
Cereal
Pork
Beef
62
Conclusions
  • Weight gain is higher for the high protein meat
    diets
  • Increasing the level of protein - increases
    weight gain but only if source of protein is a
    meat source

63
Measures of Shape
64
Measures of Shape
  • Skewness
  • Kurtosis

Negatively skewed
Symmetric
Positively skewed
Leptokurtic
Normal (mesokurtic)
Platykurtic
65
  • Measure of Skewness based on the sum of cubes
  • Measure of Kurtosis based on the sum of 4th
    powers

66
  • The Measure of Skewness

67
  • The Measure of Kurtosis

The 3 is subtracted so that g2 is zero for the
normal distribution
68
Interpretations of Measures of Shape
  • Skewness
  • Kurtosis

g1 gt 0
g1 0
g1 lt 0
g2 lt 0
g2 0
g2 gt 0
69
Descriptive techniques for Multivariate data
In most research situations data is collected on
more than one variable (usually many variables)
70
Graphical Techniques
  • The scatter plot
  • The two dimensional Histogram

71
The Scatter Plot
  • For two variables X and Y we will have a
    measurements for each variable on each case
  • xi, yi
  • xi the value of X for case i
  • and
  • yi the value of Y for case i.

72
  • To Construct a scatter plot we plot the points
  • (xi, yi)
  • for each case on the X-Y plane.

(xi, yi)
yi
xi
73
  Data Set 3 The following table gives data on
Verbal IQ, Math IQ, Initial Reading Acheivement
Score, and Final Reading Acheivement Score for 23
students who have recently completed a reading
improvement program   Initial Final Verbal
Math Reading Reading Student IQ IQ Acheivement
Acheivement   1 86 94 1.1 1.7 2 104 103 1.5 1.7
3 86 92 1.5 1.9 4 105 100 2.0 2.0 5 118 115 1.9
3.5 6 96 102 1.4 2.4 7 90 87 1.5 1.8 8 95 100
1.4 2.0 9 105 96 1.7 1.7 10 84 80 1.6 1.7 11 94
87 1.6 1.7 12 119 116 1.7 3.1 13 82 91 1.2 1.8
14 80 93 1.0 1.7 15 109 124 1.8 2.5 16 111 119
1.4 3.0 17 89 94 1.6 1.8 18 99 117 1.6 2.6 19 9
4 93 1.4 1.4 20 99 110 1.4 2.0 21 95 97 1.5 1.3
22 102 104 1.7 3.1 23 102 93 1.6 1.9
74
(No Transcript)
75
(84,80)
76
(No Transcript)
77
Some Scatter Patterns
78
(No Transcript)
79
(No Transcript)
80
  • Circular
  • No relationship between X and Y
  • Unable to predict Y from X

81
(No Transcript)
82
(No Transcript)
83
  • Ellipsoidal
  • Positive relationship between X and Y
  • Increases in X correspond to increases in Y (but
    not always)
  • Major axis of the ellipse has positive slope

84
(No Transcript)
85
Example
  • Verbal IQ, MathIQ

86
(No Transcript)
87
Some More Patterns
88
(No Transcript)
89
(No Transcript)
90
  • Ellipsoidal (thinner ellipse)
  • Stronger positive relationship between X and Y
  • Increases in X correspond to increases in Y (more
    freqequently)
  • Major axis of the ellipse has positive slope
  • Minor axis of the ellipse much smaller

91
(No Transcript)
92
  • Increased strength in the positive relationship
    between X and Y
  • Increases in X correspond to increases in Y
    (almost always)
  • Minor axis of the ellipse extremely small in
    relationship to the Major axis of the ellipse.

93
(No Transcript)
94
(No Transcript)
95
  • Perfect positive relationship between X and Y
  • Y perfectly predictable from X
  • Data falls exactly along a straight line with
    positive slope

96
(No Transcript)
97
(No Transcript)
98
  • Ellipsoidal
  • Negative relationship between X and Y
  • Increases in X correspond to decreases in Y (but
    not always)
  • Major axis of the ellipse has negative slope slope

99
(No Transcript)
100
  • The strength of the relationship can increase
    until changes in Y can be perfectly predicted
    from X

101
(No Transcript)
102
(No Transcript)
103
(No Transcript)
104
(No Transcript)
105
(No Transcript)
106
Some Non-Linear Patterns
107
(No Transcript)
108
(No Transcript)
109
  • In a Linear pattern Y increase with respect to X
    at a constant rate
  • In a Non-linear pattern the rate that Y
    increases with respect to X is variable

110
Growth Patterns
111
(No Transcript)
112
(No Transcript)
113
  • Growth patterns frequently follow a sigmoid curve
  • Growth at the start is slow
  • It then speeds up
  • Slows down again as it reaches it limiting size

114
Measures of strength of a relationship
(Correlation)
  • Pearsons correlation coefficient (r)
  • Spearmans rank correlation coefficient (rho, r)

115
  • Assume that we have collected data on two
    variables X and Y. Let
  • (x1, y1) (x2, y2) (x3, y3) (xn, yn)
  • denote the pairs of measurements on the on two
    variables X and Y for n cases in a sample (or
    population)

116
  • From this data we can compute summary statistics
    for each variable.
  • The means
  • and

117
  • The standard deviations
  • and

118
  • These statistics
  • give information for each variable separately
  • but
  • give no information about the relationship
    between the two variables

119
  • Consider the statistics

120
  • The first two statistics
  • are used to measure variability in each variable
  • they are used to compute the sample standard
    deviations

121
  • The third statistic
  • is used to measure correlation
  • If two variables are positively related the sign
    of
  • will agree with the sign of

122
  • When is positive will be
    positive.
  • When xi is above its mean, yi will be above its
    mean
  • When is negative will be
    negative.
  • When xi is below its mean, yi will be below its
    mean
  • The product will be
    positive for most cases.

123
  • This implies that the statistic
  • will be positive
  • Most of the terms in this sum will be positive

124
  • On the other hand
  • If two variables are negatively related the sign
    of
  • will be opposite in sign to

125
  • When is positive will be
    negative.
  • When xi is above its mean, yi will be below its
    mean
  • When is negative will be
    positive.
  • When xi is below its mean, yi will be above its
    mean
  • The product will be
    negative for most cases.

126
  • Again implies that the statistic
  • will be negative
  • Most of the terms in this sum will be negative

127
  • Pearsons correlation coefficient is defined as
    below

128
  • The denominator
  • is always positive

129
  • The numerator
  • is positive if there is a positive relationship
    between X ad Y and
  • negative if there is a negative relationship
    between X ad Y.
  • This property carries over to Pearsons
    correlation coefficient r

130
Properties of Pearsons correlation coefficient r
  1. The value of r is always between 1 and 1.
  2. If the relationship between X and Y is positive,
    then r will be positive.
  3. If the relationship between X and Y is negative,
    then r will be negative.
  4. If there is no relationship between X and Y, then
    r will be zero.
  5. The value of r will be 1 if the points, (xi, yi)
    lie on a straight line with positive slope.
  6. The value of r will be -1 if the points, (xi, yi)
    lie on a straight line with negative slope.

131
r 1
132
r 0.95
133
r 0.7
134
r 0.4
135
r 0
136
r -0.4
137
r -0.7
138
r -0.8
139
r -0.95
140
r -1
141
  • Computing formulae for the statistics

142

143
  • To compute
  • first compute
  • Then

144
Example
  • Verbal IQ, MathIQ

145
  Data Set 3 The following table gives data on
Verbal IQ, Math IQ, Initial Reading Acheivement
Score, and Final Reading Acheivement Score for 23
students who have recently completed a reading
improvement program   Initial Final Verbal
Math Reading Reading Student IQ IQ Acheivement
Acheivement   1 86 94 1.1 1.7 2 104 103 1.5 1.7
3 86 92 1.5 1.9 4 105 100 2.0 2.0 5 118 115 1.9
3.5 6 96 102 1.4 2.4 7 90 87 1.5 1.8 8 95 100
1.4 2.0 9 105 96 1.7 1.7 10 84 80 1.6 1.7 11 94
87 1.6 1.7 12 119 116 1.7 3.1 13 82 91 1.2 1.8
14 80 93 1.0 1.7 15 109 124 1.8 2.5 16 111 119
1.4 3.0 17 89 94 1.6 1.8 18 99 117 1.6 2.6 19 9
4 93 1.4 1.4 20 99 110 1.4 2.0 21 95 97 1.5 1.3
22 102 104 1.7 3.1 23 102 93 1.6 1.9
146
(No Transcript)
147
  • Now
  • Hence

148
  • Thus Pearsons correlation coefficient is
Write a Comment
User Comments (0)
About PowerShow.com