Title: Descriptive Statistics
1Lecture 4
- Descriptive Statistics
- Other descriptive measures
- Displaying data in tables and graphs
2Measures of Variability
- Consider the following two data sets on the ages
of all patients suffering from bladder cancer and
prostatic cancer. - The mean age of the two groups is 40 years.
- If we do not know the ages of individual patients
and are told only that the mean age of the
patients in the two groups is the same, we may
deduce that the patients in the two groups have a
similar age distribution. - Variation in the patients ages in each of these
two groups is very different. - The ages of the prostatic cancer patients have a
much larger variation than the ages of the
bladder cancer patients.
39 45 36 40 35 38 47 BC
27 52 18 33 70 PC
3Measures of Variability
- Measure the spread in the data
- Some important measures
- Range
- Mean deviation
- Variance
- Standard Deviation
- Coefficient of variation
- Interquartile Range
4Variability
- The purpose of the majority of medical,
behavioural and social science research is to
explain or account for variance or differences
among individuals or groups. - Examples
- What factors account for the variance (or
difference) in IQ among individuals? - What factors account for the variance in
treatment compliance among different groups of
patients?
5Range
- The range tells us the span over which the data
are distributed, and is only a very rough measure
of variability - Range The difference between the maximum and
minimum scores - Example The most amount of tips made in a night
is 270 and the least is 150. Therefore, the range
of tips made that night is 270 150 120 - Range is the simplest measure of dispersion.
- It is not the best measure of dispersion as it
depends entirely on the extreme scores and tells
us nothing about the middle values.
6Variation
- X
- 5 0.00 This is an example of data
- 5 0.00 with NO variability
- 5 0.00
- 5 0.00
- 5 0.00
- 25 n 5 5
7Variation
- X
- 6 1.00 This is an example of data
- 4 -1.00 with low variability
- 6 1.00
- 5 0.00
- 4 -1.00
- 25 n 5 5
8Variation
- X
- 8 3.00 This is an example of data
- 1 -4.00 with higher
variability - 9 4.00
- 5 0.00
- 2 -3.00
- 25 n 5 5
9Mean deviation
- The best measures of dispersion should
- take into account all the scores in the
distribution - and should describe the average deviation of the
scores around the mean. - Normally, to find the average we would want to
sum all deviations from the mean and then divide
by n, i.e., - BUT We have a problem.
will always add up to zero
10Deviations from the mean
- In any group of scores, the sum of the deviations
from the mean equals zero -
- X X- µ n 6
- 3 3 - 5.50 -2.50 µ S X/n
- 5 5 - 5.50 -0.50 µ 33/6
- 9 9 - 5.50 3.50 µ 5.50
- 2 2 - 5.50 -3.50
- 8 8 - 5.50 2.50
- 6 6 - 5.50 0.50
- SX 33 S(X- µ) 0.00
-
11Variance Standard Deviation
- However, if we square each of the deviations from
the mean, we obtain a sum that is not equal to
zero - This is the basis for the measures of variance
and standard deviation, the two most common
measures of variability (or dispersion) of data
12Variance Standard Deviation (cont)
- X
- 8 3.00
9.00 - 1 -4.00 16.00
- 9 4.00
16.00 - 5 0.00 0.00
- 2 -3.00
9.00 - 25 0.00
50.00 - Note The is called the Sum
of Squares
13Steps to calculate standard deviation
- Compute the mean.
- Subtract the mean from each observation.
- Square each of the deviations.
- Sum them.
- Divide by one less than the number of
observations (almost the mean). - Take the square root.
14Variance of a Population
- The sum of squared deviations from the mean
divided by the number of scores (sigma squared)
15Sample Variance
- The sum of squared deviations from the mean
divided by the number of degrees of freedom (an
estimate of the population variance, n-1)
16Standard Deviation Formulas
- Population Standard Deviation
Sample Standard Deviation
Sample standard deviation usually underestimates
population standard deviation. Using n-1 in the
denominator corrects for this and gives us a
better estimate of the population standard
deviation.
17Why use Standard Deviation and not Variance!??!
- Normally, you will only calculate variance in
order to calculate standard deviation, as
standard deviation is what we typically want. - Why? Because standard deviation expresses
variability in the same units as the data. - Example Standard deviation of ages in a class is
3.7 years (and the variance would be 13.69 years2
(3.7)2).
18Coefficient of variation
- It is a dimensionless measure of the relative
variation. - Constructed by dividing the standard deviation by
the mean and multiplying by 100. - CV (s/x) (100)
- Used to compare the variability in one data set
with that in another when a direct comparison of
standard deviation is not appropriate.
19Coefficient of variation
- The formula is
- CV (s/x) (100)
- Suppose two samples of human males yield the
following results
Children Adults
11 yrs 25 yrs Mean age
80lbs 145lbs Mean wt
10lbs 10lbs SD
12.5 6.9 CV
20Interquartile Range
- Quartiles refer to the division of the
distribution into 4 equal parts - Q1 refers to the first 25 of the scores -25th
percentile - Q2 refers to the next 25 of the scores (from Q1
to Q2) the median (50th percentile) - Q3 refers to the scores between Q2 and Q3 -75th
percentile - Q4 refers to the final 25 of the scores 100th
percentile - The IQR contains the middle 50 of the scores.
It is obtained by Q3 Q1 (i.e. the 75th
percentile the 25th percentile)
21Calculating IQR
- Step 1. Divide the scores into 4 equal parts
(12/4 3) - Step 2. Find Q1 and Q3
- - Q1 lies midway between the 3rd and 4th score
- - Q2 lies midway between the 9th score 10th
score - Step 3. Calculate Q3-Q1
22Example
- Back to our example
- 150, 165, 170, 175, 180, 190, 210, 210, 235, 240,
260, 270 - Step 1 Divide the scores into 4 equal parts
- 150, 165, 170 175, 180, 190 210, 210,
235 240, 260, 270 -
- Q1 Q2 Q3
- Step 2 Find Q1 and Q3
- Q1 (170 175)/2 Q3 (235 240)/2
- 172.5 237.5
- Step 3 Calculate Q3-Q1
- Q3 Q1 237.5 172.5
- 65
23Weighted Mean
Problem You have two classes, with 5 and 25
students, respectively. In the smaller class
(n5), the average grade is 60 In the larger
class (n25), the average grade is 45 What
is the average overall?
Not this!!!!!!!! (60 45)/2
24Measures to use with nominal or ordinal data
- When observations are measured on a nominal, or
ordinal scale, the methods just discussed for
describing the middle and the spread do not work. -
- Characteristics measured on nominal or ordinal
scales do not have numerical values but are
counts or frequencies of occurrence.
25Example
- Proportions and percentages
- A proportion is the number (a) of observations
with a given characteristic (such a dying)
divided by the total number of observations that
both lived and died (ab) - Proportion p a/(ab) or 98/945 0.104.
- A percentage is a proportion multiplied by 100.
- Ratios
- A ratio is the number (a) of observations in a
given group with a given characteristic (such as
dying) divided by the number (b) of observations
without the given characteristic - ratio a/b
- A ratio is always defined as a part divided by
another part. - 98/847 0.116 or 152/787 0.193.
Treatment groups Treatment groups
Placebo Timolol Survival
152 (c) 98 (a) Died
787 (d) 847 (b) Survived
939 945 Total
26Rates
- Rates are similar to proportions except that a
multiplier (e.g., 1000, 10,000, or 100,000) is
used and they are computed over a specified
period of time. The multiplier is called the base
and the formula is - Rate a/(ab) base
- For example, if the timolol study lasted exactly
one year, the rate of death per 10,000 patients
taking timolol per year is (98/945) (10,000)
1037 per 10,000 patients per year.
27Categorical Graphs (Nominal or Ordinal)
28Pie Charts and Nominal Data
- Pie charts are commonly used to represent the
frequency of scores for nominal data - Example patients distributed according to grade
- 20 have grade I 70 of the patients have grade
I and 10 have grade III.
29Pie Charts (Counts and Percents)
30Barcharts and Nominal Data
- Barcharts are sometimes used to represent the
frequency of scores for nominal data - Here, frequency is expressed as a percentage of
the total number of males and females - (78 and 68)
31Vertical Bar Graphs
Index
32Horizontal Bar Graphs
33Numerical Graphs
- Histograms
- Frequency polygons
- Boxplots
34Example
- What is the age of this group of children?
- 4 7 7 7 8 8 7 8 9 4 7 3 6 9 10 5 7
10 6 8 - 7 8 7 8 7 4 5 10 10 0 9 8 3 7 9 7 9
5 8 5 - 0 4 6 6 7 5 3 2 8 5 10 9 10 6 4 8 8
8 4 8 - 7 3 7 8 8 8 7 9 7 5 6 3 4 8 7 5 7
3 3 6 - 5 7 5 7 8 8 7 10 5 4 3 7 6 3 9 7 8
5 7 9 - 9 3 1 8 6 6 4 8 5 10 4 8 10 5 5 4 9
4 7 7 - 7 6 6 4 4 4 9 7 10 4 7 5 10 7 9 2 7
5 9 10 - 3 7 2 5 9 8 10 10 6 8 3
35Frequency Tables
- A frequency table shows how often each value of
the variable occurs. - Also called frequency distribution table
Age (years) Frequency
10 14
9 15
8 26
7 31
6 13
5 18
4 16
3 12
2 3
1 1
0 2
36Histograms
- A way of visually representing information
contained in a frequency table - Histograms are kind of like bar charts bars are
used instead of connected points - The bars typically cover intervals of values.
The first bar here covers scores gt 0 and lt 1.
37Histogram
Note that these are analogous to counts and
percents with bar charts
38Frequency Polygon
- Another way of visual representation of
information contained in a frequency table - Align all possible values on the bottom of the
graph (the x-axis) - On the vertical line (the y-axis), place a point
denoting the frequency of scores for each value - Connect the lines
- (Typically add an extra value above and below the
actual range of values)
39Boxplots
- Boxplots graphically represent the scores in a
distribution - Made using 5 number summary
- Within the box are all scores that fall between
the 25th and 75th percentile - The whiskers capture all scores within 1.5 IQRs
of the box boundary - Outliers are between 1.5 and 3 IQRs
- Extreme outliers are beyond 3 IQRs
40Shapes of Distributions
- These representational aides all describe
frequency distributions the way score
frequencies are distributed with respect to the
values of the variable - Distributions can take on a number of shapes or
forms
41Unimodal Distributions
- The mode of a distribution refers to the most
frequently occurring score - In a unimodal distribution, one score occurs much
more frequently than others
42Multimodal Distributions
- In multimodal distributions, more than one mode
exists (or approximately so) - In a bimodal distribution, two modes exist
43Rectangular or Uniform Distributions
- In a uniform distribution, all values are
observed equally often
44Symmetrical and Skewed Distributions
- A symmetrical distribution is balanced if we cut
it in half, the two sides would be mirror images
of one another - normal distribution a particular kind of
distribution that resembles a bell (bell-shaped
distribution)
45Skewed Distributions
- A skewed distribution is unbalanced there may be
a cluster of scores piling on one end of the scale
46Skewed
positively skewed distribution (skewed right)
negatively skewed distribution (skewed left)
47Mean, median and mode
mode
median
mean
mode
median
mean
48Using different measures of central tendency
- Two factors are important in making the decision
of which measure of central tendency should be
used - Scale of measurement (ordinal or numerical)
- Shape of the distribution of observations.
- A distribution can be symmetric or skewed to the
right, positively skewed or to the left,
negatively skewed.
49Using different measures of central tendency The
following guidelines help the researcher decide
which measure is best with a given set of data
- The mean is used for numerical data and for
symmetric distribution.
50Using different measures of central tendency The
following guidelines help the researcher decide
which measure is best with a given set of data
- The median is used for ordinal data or for
numerical data whose distribution is skewed.
51Using different measures of central tendency The
following guidelines help the researcher decide
which measure is best with a given set of data
- The mode is used primarily for nominal or ordinal
data or for numerical data with bimodal
distribution.
52Using different measures of dispersion
- The following guidelines help investigators
decide which measure of dispersion is most
appropriate for a given set of data - The standard deviation is used when the mean is
used i.e., with symmetric distributions of
numerical data. - Percentiles and the interquartile range are used
in two cases - When the median is used i.e., with ordinal data
or with skewed numerical data. - When the mean is used but the objective is to
compare individual observations with a set of
norms. - The interquartile range is used to describe the
50 of the distribution, regardless of the shape. - The range is used with numerical data when the
purpose is to emphasize extreme values. - The coefficient of variation is used when the
intent is to compare two numerical distributions
measured on different scales.
53General principles concerning the construction of
tables
- Tables should by fully self-explanatory.
- Units should be stated for each numerical
variable - Do not try to include too much information in a
single table. Simplicity, with reduction of
contents to the minimum is essential.
54General principles concerning the construction of
tables (cont)
- The function of ruling is to provide clarity of
interpretation - Unnecessary ruling should be avoided.
- Spacing can provide the same effect as ruling
- As a general rule, ruling should be included to
set off the title of the table, to divide major
row and column headings, and to close the table
bottom.
bp sex age Id
124 m 23 1
2
3
4
5
55General principles concerning the construction of
tables (cont)
- Numerical entries of zero should be explicitly
written rather than indicated by a dash or a
dotted line. - --- or __
- A dash or a dotted line should be reserved for
data that are missing or unobserved. - Zero is a number, and numerical observations of
zero should be explicitly presented as such. - E.g. If a survey shows no cases of poliomyelitis
in a particular county in a particular year, the
entry should indicate this fact. If the
information from that particular county was
incomplete or otherwise unavailable, a dash or a
dotted line should be used
56General principles concerning the construction of
tables (cont)
- A numerical entry should not begin with a decimal
point. - The reader runs some risk of interpreting a
leading decimal point as a foreign object. - This misinterpretation can be avoided quite
simply by showing a leading zero immediately to
the left of the decimal point. - E.g. write 0.5 instead of .5.
- Numbers indicating values of the same
characteristic should be reported to the same
number of decimal points. - E.g. dont write age21, 23.4, 27.65
57General principles concerning the construction of
graphs
- Graphs should by fully explanatory
- Many readers don't read the detailed text, they
just look at the graph. - The contents of the graph should be as complete
as possible. - Title should include information concerning who
or what the subjects or experimental material
are, - what observations are abstracted from those
subjects or material, - and what restrictions of time and place apply to
the graph. - E.g. a presentation of birth rates in the state
of Michigan - never be headed merely "Birth Rates,"
- but might well be modified to say "Birth Rates
per 1,000 Population, White Race, Michigan,
1920-1960." - If the length of title becomes a problem,
additional essential material can frequently be
included in a footnote. - In fact the graph should be as self-contained as
possible, requiring as little outside information
for clear interpretation as is feasible.
58General principles concerning the construction of
graphs (cont)
- Vertical and horizontal scales should by clearly
labeled and units should be identified. - Most graphs present numerical information in
scaled form. - Scales must be labeled in order to describe fully
the variable presented on the scale, and for
measurement variables the units of measurement
should identified. - e.g. weight (gms), age (years) etc...
59General principles concerning the construction of
graphs (cont)
- Do not try to include too much information in a
single graph. - It is better to include several graphs than to
compress information too much. - A device frequently used for the presentation of
many curves or trends is the presentation a
series of small graphs. - A safe rule of thumb is to avoid graphs
containing more than 3 curves.
60General principles concerning the construction of
graphs (cont)
- Graphs are intended to give an overview rather
than a highly detailed picture of a set of data. - Do not include too much detail in a graph.
- Detailed presentations should be reserved for
tables. - Graphs condense detail to permit to see the
forest rather than the trees. - If your main interest is in the trees, use a
table. - The inclusion of too much detail in a graph will
tend to obscure the essential points. - Avoid inclusion of numbers within the body of a
graph.