Descriptive Statistics - PowerPoint PPT Presentation

1 / 60

About This Presentation

Title:

Descriptive Statistics

Description:

Lecture #4 Descriptive Statistics Other descriptive measures Displaying data in tables and graphs Measures of Variability Consider the following two data sets on the ... – PowerPoint PPT presentation

Number of Views:140

Avg rating:3.0/5.0

Slides: 61

Provided by: Dr2081

Category:

more less

Transcript and Presenter's Notes

Title: Descriptive Statistics

1
Lecture 4

Descriptive Statistics
Other descriptive measures
Displaying data in tables and graphs

2
Measures of Variability

Consider the following two data sets on the ages
of all patients suffering from bladder cancer and
prostatic cancer.
The mean age of the two groups is 40 years.
If we do not know the ages of individual patients
and are told only that the mean age of the
patients in the two groups is the same, we may
deduce that the patients in the two groups have a
similar age distribution.
Variation in the patients ages in each of these
two groups is very different.
The ages of the prostatic cancer patients have a
much larger variation than the ages of the
bladder cancer patients.

39 45 36 40 35 38 47 BC
27 52 18 33 70 PC
3
Measures of Variability

Measure the spread in the data
Some important measures
Range
Mean deviation
Variance
Standard Deviation
Coefficient of variation
Interquartile Range

4
Variability

The purpose of the majority of medical,
behavioural and social science research is to
explain or account for variance or differences
among individuals or groups.
Examples
What factors account for the variance (or
difference) in IQ among individuals?
What factors account for the variance in
treatment compliance among different groups of
patients?

5
Range

The range tells us the span over which the data
are distributed, and is only a very rough measure
of variability
Range The difference between the maximum and
minimum scores
Example The most amount of tips made in a night
is 270 and the least is 150. Therefore, the range
of tips made that night is 270 150 120
Range is the simplest measure of dispersion.
It is not the best measure of dispersion as it
depends entirely on the extreme scores and tells
us nothing about the middle values.

6
Variation

X
5 0.00 This is an example of data
5 0.00 with NO variability
5 0.00
5 0.00
5 0.00
25 n 5 5

7
Variation

X
6 1.00 This is an example of data
4 -1.00 with low variability
6 1.00
5 0.00
4 -1.00
25 n 5 5

8
Variation

X
8 3.00 This is an example of data
1 -4.00 with higher
variability
9 4.00
5 0.00
2 -3.00
25 n 5 5

9
Mean deviation

The best measures of dispersion should
take into account all the scores in the
distribution
and should describe the average deviation of the
scores around the mean.
Normally, to find the average we would want to
sum all deviations from the mean and then divide
by n, i.e.,
BUT We have a problem.
will always add up to zero

10
Deviations from the mean

In any group of scores, the sum of the deviations
from the mean equals zero
X X- µ n 6
3 3 - 5.50 -2.50 µ S X/n
5 5 - 5.50 -0.50 µ 33/6
9 9 - 5.50 3.50 µ 5.50
2 2 - 5.50 -3.50
8 8 - 5.50 2.50
6 6 - 5.50 0.50
SX 33 S(X- µ) 0.00

11
Variance Standard Deviation

However, if we square each of the deviations from
the mean, we obtain a sum that is not equal to
zero
This is the basis for the measures of variance
and standard deviation, the two most common
measures of variability (or dispersion) of data

12
Variance Standard Deviation (cont)

X
8 3.00
9.00
1 -4.00 16.00
9 4.00
16.00
5 0.00 0.00
2 -3.00
9.00
25 0.00
50.00
Note The is called the Sum
of Squares

13
Steps to calculate standard deviation

Compute the mean.
Subtract the mean from each observation.
Square each of the deviations.
Sum them.
Divide by one less than the number of
observations (almost the mean).
Take the square root.

14
Variance of a Population

The sum of squared deviations from the mean
divided by the number of scores (sigma squared)

15
Sample Variance

The sum of squared deviations from the mean
divided by the number of degrees of freedom (an
estimate of the population variance, n-1)

16
Standard Deviation Formulas

Population Standard Deviation

Sample Standard Deviation
Sample standard deviation usually underestimates
population standard deviation. Using n-1 in the
denominator corrects for this and gives us a
better estimate of the population standard
deviation.
17
Why use Standard Deviation and not Variance!??!

Normally, you will only calculate variance in
order to calculate standard deviation, as
standard deviation is what we typically want.
Why? Because standard deviation expresses
variability in the same units as the data.
Example Standard deviation of ages in a class is
3.7 years (and the variance would be 13.69 years2
(3.7)2).

18
Coefficient of variation

It is a dimensionless measure of the relative
variation.
Constructed by dividing the standard deviation by
the mean and multiplying by 100.
CV (s/x) (100)
Used to compare the variability in one data set
with that in another when a direct comparison of
standard deviation is not appropriate.

19
Coefficient of variation

The formula is
CV (s/x) (100)
Suppose two samples of human males yield the
following results

Children Adults
11 yrs 25 yrs Mean age
80lbs 145lbs Mean wt
10lbs 10lbs SD
12.5 6.9 CV
20
Interquartile Range

Quartiles refer to the division of the
distribution into 4 equal parts
Q1 refers to the first 25 of the scores -25th
percentile
Q2 refers to the next 25 of the scores (from Q1
to Q2) the median (50th percentile)
Q3 refers to the scores between Q2 and Q3 -75th
percentile
Q4 refers to the final 25 of the scores 100th
percentile
The IQR contains the middle 50 of the scores.
It is obtained by Q3 Q1 (i.e. the 75th
percentile the 25th percentile)

21
Calculating IQR

Step 1. Divide the scores into 4 equal parts
(12/4 3)
Step 2. Find Q1 and Q3
- Q1 lies midway between the 3rd and 4th score
- Q2 lies midway between the 9th score 10th
score
Step 3. Calculate Q3-Q1

22
Example

Back to our example
150, 165, 170, 175, 180, 190, 210, 210, 235, 240,
260, 270
Step 1 Divide the scores into 4 equal parts
150, 165, 170 175, 180, 190 210, 210,
235 240, 260, 270
Q1 Q2 Q3
Step 2 Find Q1 and Q3
Q1 (170 175)/2 Q3 (235 240)/2
172.5 237.5
Step 3 Calculate Q3-Q1
Q3 Q1 237.5 172.5
65

23
Weighted Mean
Problem You have two classes, with 5 and 25
students, respectively. In the smaller class
(n5), the average grade is 60 In the larger
class (n25), the average grade is 45 What
is the average overall?
Not this!!!!!!!! (60 45)/2
24
Measures to use with nominal or ordinal data

When observations are measured on a nominal, or
ordinal scale, the methods just discussed for
describing the middle and the spread do not work.
Characteristics measured on nominal or ordinal
scales do not have numerical values but are
counts or frequencies of occurrence.

25
Example

Proportions and percentages
A proportion is the number (a) of observations
with a given characteristic (such a dying)
divided by the total number of observations that
both lived and died (ab)
Proportion p a/(ab) or 98/945 0.104.
A percentage is a proportion multiplied by 100.
Ratios
A ratio is the number (a) of observations in a
given group with a given characteristic (such as
dying) divided by the number (b) of observations
without the given characteristic
ratio a/b
A ratio is always defined as a part divided by
another part.
98/847 0.116 or 152/787 0.193.

Treatment groups Treatment groups
Placebo Timolol Survival
152 (c) 98 (a) Died
787 (d) 847 (b) Survived
939 945 Total
26
Rates

Rates are similar to proportions except that a
multiplier (e.g., 1000, 10,000, or 100,000) is
used and they are computed over a specified
period of time. The multiplier is called the base
and the formula is
Rate a/(ab) base
For example, if the timolol study lasted exactly
one year, the rate of death per 10,000 patients
taking timolol per year is (98/945) (10,000)
1037 per 10,000 patients per year.

27
Categorical Graphs (Nominal or Ordinal)

Pie Charts
Bar Graphs

28
Pie Charts and Nominal Data

Pie charts are commonly used to represent the
frequency of scores for nominal data
Example patients distributed according to grade
20 have grade I 70 of the patients have grade
I and 10 have grade III.

29
Pie Charts (Counts and Percents)
30
Barcharts and Nominal Data

Barcharts are sometimes used to represent the
frequency of scores for nominal data
Here, frequency is expressed as a percentage of
the total number of males and females
(78 and 68)

31
Vertical Bar Graphs
Index
32
Horizontal Bar Graphs
33
Numerical Graphs

Histograms
Frequency polygons
Boxplots

34
Example

What is the age of this group of children?
4 7 7 7 8 8 7 8 9 4 7 3 6 9 10 5 7
10 6 8
7 8 7 8 7 4 5 10 10 0 9 8 3 7 9 7 9
5 8 5
0 4 6 6 7 5 3 2 8 5 10 9 10 6 4 8 8
8 4 8
7 3 7 8 8 8 7 9 7 5 6 3 4 8 7 5 7
3 3 6
5 7 5 7 8 8 7 10 5 4 3 7 6 3 9 7 8
5 7 9
9 3 1 8 6 6 4 8 5 10 4 8 10 5 5 4 9
4 7 7
7 6 6 4 4 4 9 7 10 4 7 5 10 7 9 2 7
5 9 10
3 7 2 5 9 8 10 10 6 8 3

35
Frequency Tables

A frequency table shows how often each value of
the variable occurs.
Also called frequency distribution table

Age (years) Frequency
10 14
9 15
8 26
7 31
6 13
5 18
4 16
3 12
2 3
1 1
0 2
36
Histograms

A way of visually representing information
contained in a frequency table
Histograms are kind of like bar charts bars are
used instead of connected points
The bars typically cover intervals of values.
The first bar here covers scores gt 0 and lt 1.

37
Histogram
Note that these are analogous to counts and
percents with bar charts
38
Frequency Polygon

Another way of visual representation of
information contained in a frequency table
Align all possible values on the bottom of the
graph (the x-axis)
On the vertical line (the y-axis), place a point
denoting the frequency of scores for each value
Connect the lines
(Typically add an extra value above and below the
actual range of values)

39
Boxplots

Boxplots graphically represent the scores in a
distribution
Made using 5 number summary
Within the box are all scores that fall between
the 25th and 75th percentile
The whiskers capture all scores within 1.5 IQRs
of the box boundary
Outliers are between 1.5 and 3 IQRs
Extreme outliers are beyond 3 IQRs

40
Shapes of Distributions

These representational aides all describe
frequency distributions the way score
frequencies are distributed with respect to the
values of the variable
Distributions can take on a number of shapes or
forms

41
Unimodal Distributions

The mode of a distribution refers to the most
frequently occurring score
In a unimodal distribution, one score occurs much
more frequently than others

42
Multimodal Distributions

In multimodal distributions, more than one mode
exists (or approximately so)
In a bimodal distribution, two modes exist

43
Rectangular or Uniform Distributions

In a uniform distribution, all values are
observed equally often

44
Symmetrical and Skewed Distributions

A symmetrical distribution is balanced if we cut
it in half, the two sides would be mirror images
of one another
normal distribution a particular kind of
distribution that resembles a bell (bell-shaped
distribution)

45
Skewed Distributions

A skewed distribution is unbalanced there may be
a cluster of scores piling on one end of the scale

46
Skewed
positively skewed distribution (skewed right)
negatively skewed distribution (skewed left)
47
Mean, median and mode
mode
median
mean
mode
median
mean
48
Using different measures of central tendency

Two factors are important in making the decision
of which measure of central tendency should be
used
Scale of measurement (ordinal or numerical)
Shape of the distribution of observations.
A distribution can be symmetric or skewed to the
right, positively skewed or to the left,
negatively skewed.

49
Using different measures of central tendency The
following guidelines help the researcher decide
which measure is best with a given set of data

The mean is used for numerical data and for
symmetric distribution.

50
Using different measures of central tendency The
following guidelines help the researcher decide
which measure is best with a given set of data

The median is used for ordinal data or for
numerical data whose distribution is skewed.

51
Using different measures of central tendency The
following guidelines help the researcher decide
which measure is best with a given set of data

The mode is used primarily for nominal or ordinal
data or for numerical data with bimodal
distribution.

52
Using different measures of dispersion

The following guidelines help investigators
decide which measure of dispersion is most
appropriate for a given set of data
The standard deviation is used when the mean is
used i.e., with symmetric distributions of
numerical data.
Percentiles and the interquartile range are used
in two cases
When the median is used i.e., with ordinal data
or with skewed numerical data.
When the mean is used but the objective is to
compare individual observations with a set of
norms.
The interquartile range is used to describe the
50 of the distribution, regardless of the shape.
The range is used with numerical data when the
purpose is to emphasize extreme values.
The coefficient of variation is used when the
intent is to compare two numerical distributions
measured on different scales.

53
General principles concerning the construction of
tables

Tables should by fully self-explanatory.
Units should be stated for each numerical
variable
Do not try to include too much information in a
single table. Simplicity, with reduction of
contents to the minimum is essential.

54
General principles concerning the construction of
tables (cont)

The function of ruling is to provide clarity of
interpretation
Unnecessary ruling should be avoided.
Spacing can provide the same effect as ruling
As a general rule, ruling should be included to
set off the title of the table, to divide major
row and column headings, and to close the table
bottom.

bp sex age Id
124 m 23 1
2
3
4
5
55
General principles concerning the construction of
tables (cont)

Numerical entries of zero should be explicitly
written rather than indicated by a dash or a
dotted line.
--- or __
A dash or a dotted line should be reserved for
data that are missing or unobserved.
Zero is a number, and numerical observations of
zero should be explicitly presented as such.
E.g. If a survey shows no cases of poliomyelitis
in a particular county in a particular year, the
entry should indicate this fact. If the
information from that particular county was
incomplete or otherwise unavailable, a dash or a
dotted line should be used

56
General principles concerning the construction of
tables (cont)

A numerical entry should not begin with a decimal
point.
The reader runs some risk of interpreting a
leading decimal point as a foreign object.
This misinterpretation can be avoided quite
simply by showing a leading zero immediately to
the left of the decimal point.
E.g. write 0.5 instead of .5.
Numbers indicating values of the same
characteristic should be reported to the same
number of decimal points.
E.g. dont write age21, 23.4, 27.65

57
General principles concerning the construction of
graphs

Graphs should by fully explanatory
Many readers don't read the detailed text, they
just look at the graph.
The contents of the graph should be as complete
as possible.
Title should include information concerning who
or what the subjects or experimental material
are,
what observations are abstracted from those
subjects or material,
and what restrictions of time and place apply to
the graph.
E.g. a presentation of birth rates in the state
of Michigan
never be headed merely "Birth Rates,"
but might well be modified to say "Birth Rates
per 1,000 Population, White Race, Michigan,
1920-1960."
If the length of title becomes a problem,
additional essential material can frequently be
included in a footnote.
In fact the graph should be as self-contained as
possible, requiring as little outside information
for clear interpretation as is feasible.

58
General principles concerning the construction of
graphs (cont)

Vertical and horizontal scales should by clearly
labeled and units should be identified.
Most graphs present numerical information in
scaled form.
Scales must be labeled in order to describe fully
the variable presented on the scale, and for
measurement variables the units of measurement
should identified.
e.g. weight (gms), age (years) etc...

59
General principles concerning the construction of
graphs (cont)

Do not try to include too much information in a
single graph.
It is better to include several graphs than to
compress information too much.
A device frequently used for the presentation of
many curves or trends is the presentation a
series of small graphs.
A safe rule of thumb is to avoid graphs
containing more than 3 curves.

60
General principles concerning the construction of
graphs (cont)

Graphs are intended to give an overview rather
than a highly detailed picture of a set of data.
Do not include too much detail in a graph.
Detailed presentations should be reserved for
tables.
Graphs condense detail to permit to see the
forest rather than the trees.
If your main interest is in the trees, use a
table.
The inclusion of too much detail in a graph will
tend to obscure the essential points.
Avoid inclusion of numbers within the body of a
graph.