Title: Describing Data
1Describing Data
- Descriptive Statistics
- Central Tendency and Variation
2Lecture Objectives
- You should be able to
- Compute and interpret appropriate measures of
centrality and variation. - Recognize distributions of data.
- Apply properties of normally distributed data
based on the mean and variance. - Compute and interpret covariance and correlation.
3Summary Measures
- 1. Measures of Central Location
- Mean, Median, Mode
- 2. Measures of Variation
- Range, Percentile, Variance, Standard
Deviation - 3. Measures of Association
- Covariance, Correlation
4Measures of Central LocationThe Arithmetic Mean
It is the Arithmetic Average of data
values The Most Common Measure of
Central Tendency Affected by Extreme Values
(Outliers)
Sample Mean
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10 12
14
Mean 5
Mean 6
5Median
Important Measure of Central Tendency In an
ordered array, the median is the middle
number. If n is odd, the median is the middle
number. If n is even, the median is the average
of the 2 middle numbers. Not Affected by Extreme
Values
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10 12
14
Median 5
Median 5
6Mode
A Measure of Central Tendency Value that Occurs
Most Often Not Affected by Extreme Values There
May Not be a Mode There May be Several Modes Used
for Either Numerical or Categorical Data
0 1 2 3 4 5 6 7 8 9 10 11
12 13 14
0 1 2 3 4 5 6
No Mode
Mode 9
7Measures of Variability
- Range
- The simplest measure
- Percentile
- Used with Median
- Variance/Standard Deviation
- Used with the Mean
8Range
Difference Between Largest Smallest
Observations Range Ignores How
Data Are Distributed
Range 12 - 7 5
9Percentile
2008 Olympic Medal Tally for top 55 nations. What
is the percentile score for a country with 9
medals? What is the 50th percentile?
Obs Medals Obs Medals Obs Medals Obs Medals Obs Medals
1 110 12 24 23 10 34 6 45 3
2 100 13 19 24 9 35 6 46 3
3 72 14 18 25 8 36 6 47 2
4 47 15 18 26 8 37 5 48 2
5 46 16 16 27 7 38 5 49 2
6 41 17 15 28 7 39 5 50 2
7 40 18 14 29 7 40 4 51 2
8 31 19 13 30 6 41 4 52 1
9 28 20 11 31 6 42 4 53 1
10 27 21 10 32 6 43 4 54 1
11 25 22 10 33 6 44 3 55 1
10Percentile - solutions
- Order all data (ascending or descending).
- Country with 9 medals ranks 24th out of 55. There
are 31 nations (56.36) below it and 23 nations
(41.82) above it. Hence it can be considered a
57th or 58th percentile score. - The medal tally that corresponds to a 50th
percentile is the one in the middle of the group,
or the 28th country, with 7 medals. Hence the
50th percentile (Median) is 7. - Now compute the first and third quartile values.
11Box Plot
- The box plot shows 5 points, as follows
12Outliers
Interquartile Range (IQR) Q3 Q1 60-40
20 1 Step 1.5 IQR 1.520 30 Q1 30
40 - 30 10 Q3 30 60 30 90 Any point
outside the limits (10, 90) is considered an
outlier.
13Variance
For the Population
For the Sample
Variance is in squared units, and can be
difficult to interpret. For instance, if data are
in dollars, variance is in squared dollars.
14Standard Deviation
For the Population
For the Sample
Standard deviation is the square root of the
variance.
15Computing Standard Deviation
Computing Sample Variance and Standard Deviation Computing Sample Variance and Standard Deviation Computing Sample Variance and Standard Deviation Computing Sample Variance and Standard Deviation
Mean of X 6
Deviation
X From Mean Squared
3 -3 9
4 -2 4
6 0 0
8 2 4
9 3 9
26 Sum of Squares
6.50 Variance SS/n-1
2.55 Stdev Sqrt(Variance)
16The Normal Distribution
- A property of normally distributed data is as
follows
Distance from Mean Percent of observations included in that range
1 standard deviation Approximately 68
2 standard deviations Approximately 95
3 standard deviations Approximately 99.74
17Comparing Standard Deviations
Mean 15.5 s 3.338
Mean 15.5 s .9258
Mean 15.5 s 4.57
11 12 13 14 15 16 17 18
19 20 21
18Outliers
- Typically, a number beyond a certain number of
standard deviations is considered an outlier. - In many cases, a number beyond 3 standard
deviations (about 0.25 chance of occurring) is
considered an outlier. - If identifying an outlier is more critical, one
can make the rule more stringent, and consider 2
standard deviations as the limit.
19Coefficient of Variation
Standard deviation relative to the mean. Helps
compare deviations for samples with different
means
20Computing CV
- Stock A Average Price last year 50
- Standard Deviation 5
- Stock B Average Price last year 100
- Standard Deviation 5
Coefficient of Variation Stock A CV
10 Stock B CV 5
21Standardizing Data
Obs Age Income Z-Age Z-Income
1 25 25000 -1.05 -1.13
2 28 52000 -0.86 -0.63
3 35 63000 -0.41 -0.43
4 36 74000 -0.34 -0.22
5 39 69000 -0.15 -0.31
6 45 80000 0.23 -0.11
7 48 125000 0.42 0.72
8 75 200000 2.15 2.11
Mean 41.38 86000.00
Std Dev 15.63 53973.54
Which of the two numbers for person 8 is farther
from the mean? The age of 75 or the income of
200,000?
Z scores tell us the distance from the mean,
measured in standard deviations
22Measures of Association
- Covariance and Correlation
Covariance measures the average product of the
deviations of two variables from their
means. Correlation is the standardized form of
covariance (divided by the product of their
standard deviations). Correlation is always
between -1 and 1.
Mean Mean
2 9
Stdev 1 3.6
X Dev Product Dev Y
1 -1 3 -3 6
2 0 0 -1 8
3 1 4 4 13
7
Covariance Covariance 3.5
Correlation Correlation 0.97