Descriptive Statistics - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Descriptive Statistics

Description:

Lecture #4 Descriptive Statistics Other descriptive measures Displaying data in tables and graphs Measures of Variability Consider the following two data sets on the ... – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 61
Provided by: Dr2081
Category:

less

Transcript and Presenter's Notes

Title: Descriptive Statistics


1
Lecture 4
  • Descriptive Statistics
  • Other descriptive measures
  • Displaying data in tables and graphs

2
Measures of Variability
  • Consider the following two data sets on the ages
    of all patients suffering from bladder cancer and
    prostatic cancer.
  • The mean age of the two groups is 40 years.
  • If we do not know the ages of individual patients
    and are told only that the mean age of the
    patients in the two groups is the same, we may
    deduce that the patients in the two groups have a
    similar age distribution.
  • Variation in the patients ages in each of these
    two groups is very different.
  • The ages of the prostatic cancer patients have a
    much larger variation than the ages of the
    bladder cancer patients.

39 45 36 40 35 38 47 BC
27 52 18 33 70 PC
3
Measures of Variability
  • Measure the spread in the data
  • Some important measures
  • Range
  • Mean deviation
  • Variance
  • Standard Deviation
  • Coefficient of variation
  • Interquartile Range

4
Variability
  • The purpose of the majority of medical,
    behavioural and social science research is to
    explain or account for variance or differences
    among individuals or groups.
  • Examples
  • What factors account for the variance (or
    difference) in IQ among individuals?
  • What factors account for the variance in
    treatment compliance among different groups of
    patients?

5
Range
  • The range tells us the span over which the data
    are distributed, and is only a very rough measure
    of variability
  • Range The difference between the maximum and
    minimum scores
  • Example The most amount of tips made in a night
    is 270 and the least is 150. Therefore, the range
    of tips made that night is 270 150 120
  • Range is the simplest measure of dispersion.
  • It is not the best measure of dispersion as it
    depends entirely on the extreme scores and tells
    us nothing about the middle values.

6
Variation
  • X
  • 5 0.00 This is an example of data
  • 5 0.00 with NO variability
  • 5 0.00
  • 5 0.00
  • 5 0.00
  • 25 n 5 5

7
Variation
  • X
  • 6 1.00 This is an example of data
  • 4 -1.00 with low variability
  • 6 1.00
  • 5 0.00
  • 4 -1.00
  • 25 n 5 5

8
Variation
  • X
  • 8 3.00 This is an example of data
  • 1 -4.00 with higher
    variability
  • 9 4.00
  • 5 0.00
  • 2 -3.00
  • 25 n 5 5

9
Mean deviation
  • The best measures of dispersion should
  • take into account all the scores in the
    distribution
  • and should describe the average deviation of the
    scores around the mean.
  • Normally, to find the average we would want to
    sum all deviations from the mean and then divide
    by n, i.e.,
  • BUT We have a problem.
    will always add up to zero

10
Deviations from the mean
  • In any group of scores, the sum of the deviations
    from the mean equals zero
  • X X- µ n 6
  • 3 3 - 5.50 -2.50 µ S X/n
  • 5 5 - 5.50 -0.50 µ 33/6
  • 9 9 - 5.50 3.50 µ 5.50
  • 2 2 - 5.50 -3.50
  • 8 8 - 5.50 2.50
  • 6 6 - 5.50 0.50
  • SX 33 S(X- µ) 0.00

11
Variance Standard Deviation
  • However, if we square each of the deviations from
    the mean, we obtain a sum that is not equal to
    zero
  • This is the basis for the measures of variance
    and standard deviation, the two most common
    measures of variability (or dispersion) of data

12
Variance Standard Deviation (cont)
  • X
  • 8 3.00
    9.00
  • 1 -4.00 16.00
  • 9 4.00
    16.00
  • 5 0.00 0.00
  • 2 -3.00
    9.00
  • 25 0.00
    50.00
  • Note The is called the Sum
    of Squares

13
Steps to calculate standard deviation
  • Compute the mean.
  • Subtract the mean from each observation.
  • Square each of the deviations.
  • Sum them.
  • Divide by one less than the number of
    observations (almost the mean).
  • Take the square root.

14
Variance of a Population
  • The sum of squared deviations from the mean
    divided by the number of scores (sigma squared)

15
Sample Variance
  • The sum of squared deviations from the mean
    divided by the number of degrees of freedom (an
    estimate of the population variance, n-1)

16
Standard Deviation Formulas
  • Population Standard Deviation

Sample Standard Deviation
Sample standard deviation usually underestimates
population standard deviation. Using n-1 in the
denominator corrects for this and gives us a
better estimate of the population standard
deviation.
17
Why use Standard Deviation and not Variance!??!
  • Normally, you will only calculate variance in
    order to calculate standard deviation, as
    standard deviation is what we typically want.
  • Why? Because standard deviation expresses
    variability in the same units as the data.
  • Example Standard deviation of ages in a class is
    3.7 years (and the variance would be 13.69 years2
    (3.7)2).

18
Coefficient of variation
  • It is a dimensionless measure of the relative
    variation.
  • Constructed by dividing the standard deviation by
    the mean and multiplying by 100.
  • CV (s/x) (100)
  • Used to compare the variability in one data set
    with that in another when a direct comparison of
    standard deviation is not appropriate.

19
Coefficient of variation
  • The formula is
  • CV (s/x) (100)
  • Suppose two samples of human males yield the
    following results

Children Adults
11 yrs 25 yrs Mean age
80lbs 145lbs Mean wt
10lbs 10lbs SD
12.5 6.9 CV
20
Interquartile Range
  • Quartiles refer to the division of the
    distribution into 4 equal parts
  • Q1 refers to the first 25 of the scores -25th
    percentile
  • Q2 refers to the next 25 of the scores (from Q1
    to Q2) the median (50th percentile)
  • Q3 refers to the scores between Q2 and Q3 -75th
    percentile
  • Q4 refers to the final 25 of the scores 100th
    percentile
  • The IQR contains the middle 50 of the scores.
    It is obtained by Q3 Q1 (i.e. the 75th
    percentile the 25th percentile)

21
Calculating IQR
  • Step 1. Divide the scores into 4 equal parts
    (12/4 3)
  • Step 2. Find Q1 and Q3
  • - Q1 lies midway between the 3rd and 4th score
  • - Q2 lies midway between the 9th score 10th
    score
  • Step 3. Calculate Q3-Q1

22
Example
  • Back to our example
  • 150, 165, 170, 175, 180, 190, 210, 210, 235, 240,
    260, 270
  • Step 1 Divide the scores into 4 equal parts
  • 150, 165, 170 175, 180, 190 210, 210,
    235 240, 260, 270
  • Q1 Q2 Q3
  • Step 2 Find Q1 and Q3
  • Q1 (170 175)/2 Q3 (235 240)/2
  • 172.5 237.5
  • Step 3 Calculate Q3-Q1
  • Q3 Q1 237.5 172.5
  • 65

23
Weighted Mean
Problem You have two classes, with 5 and 25
students, respectively. In the smaller class
(n5), the average grade is 60 In the larger
class (n25), the average grade is 45 What
is the average overall?
Not this!!!!!!!! (60 45)/2
24
Measures to use with nominal or ordinal data
  • When observations are measured on a nominal, or
    ordinal scale, the methods just discussed for
    describing the middle and the spread do not work.
  • Characteristics measured on nominal or ordinal
    scales do not have numerical values but are
    counts or frequencies of occurrence.

25
Example
  • Proportions and percentages
  • A proportion is the number (a) of observations
    with a given characteristic (such a dying)
    divided by the total number of observations that
    both lived and died (ab)
  • Proportion p a/(ab) or 98/945 0.104.
  • A percentage is a proportion multiplied by 100.
  • Ratios
  • A ratio is the number (a) of observations in a
    given group with a given characteristic (such as
    dying) divided by the number (b) of observations
    without the given characteristic
  • ratio a/b
  • A ratio is always defined as a part divided by
    another part.
  • 98/847 0.116 or 152/787 0.193.

Treatment groups Treatment groups
Placebo Timolol Survival
152 (c) 98 (a) Died
787 (d) 847 (b) Survived
939 945 Total
26
Rates
  • Rates are similar to proportions except that a
    multiplier (e.g., 1000, 10,000, or 100,000) is
    used and they are computed over a specified
    period of time. The multiplier is called the base
    and the formula is
  • Rate a/(ab) base
  • For example, if the timolol study lasted exactly
    one year, the rate of death per 10,000 patients
    taking timolol per year is (98/945) (10,000)
    1037 per 10,000 patients per year.

27
Categorical Graphs (Nominal or Ordinal)
  • Pie Charts
  • Bar Graphs

28
Pie Charts and Nominal Data
  • Pie charts are commonly used to represent the
    frequency of scores for nominal data
  • Example patients distributed according to grade
  • 20 have grade I 70 of the patients have grade
    I and 10 have grade III.

29
Pie Charts (Counts and Percents)
30
Barcharts and Nominal Data
  • Barcharts are sometimes used to represent the
    frequency of scores for nominal data
  • Here, frequency is expressed as a percentage of
    the total number of males and females
  • (78 and 68)

31
Vertical Bar Graphs
Index
32
Horizontal Bar Graphs
33
Numerical Graphs
  • Histograms
  • Frequency polygons
  • Boxplots

34
Example
  • What is the age of this group of children?
  • 4 7 7 7 8 8 7 8 9 4 7 3 6 9 10 5 7
    10 6 8
  • 7 8 7 8 7 4 5 10 10 0 9 8 3 7 9 7 9
    5 8 5
  • 0 4 6 6 7 5 3 2 8 5 10 9 10 6 4 8 8
    8 4 8
  • 7 3 7 8 8 8 7 9 7 5 6 3 4 8 7 5 7
    3 3 6
  • 5 7 5 7 8 8 7 10 5 4 3 7 6 3 9 7 8
    5 7 9
  • 9 3 1 8 6 6 4 8 5 10 4 8 10 5 5 4 9
    4 7 7
  • 7 6 6 4 4 4 9 7 10 4 7 5 10 7 9 2 7
    5 9 10
  • 3 7 2 5 9 8 10 10 6 8 3

35
Frequency Tables
  • A frequency table shows how often each value of
    the variable occurs.
  • Also called frequency distribution table

Age (years) Frequency
10 14
9 15
8 26
7 31
6 13
5 18
4 16
3 12
2 3
1 1
0 2
36
Histograms
  • A way of visually representing information
    contained in a frequency table
  • Histograms are kind of like bar charts bars are
    used instead of connected points
  • The bars typically cover intervals of values.
    The first bar here covers scores gt 0 and lt 1.

37
Histogram
Note that these are analogous to counts and
percents with bar charts
38
Frequency Polygon
  • Another way of visual representation of
    information contained in a frequency table
  • Align all possible values on the bottom of the
    graph (the x-axis)
  • On the vertical line (the y-axis), place a point
    denoting the frequency of scores for each value
  • Connect the lines
  • (Typically add an extra value above and below the
    actual range of values)

39
Boxplots
  • Boxplots graphically represent the scores in a
    distribution
  • Made using 5 number summary
  • Within the box are all scores that fall between
    the 25th and 75th percentile
  • The whiskers capture all scores within 1.5 IQRs
    of the box boundary
  • Outliers are between 1.5 and 3 IQRs
  • Extreme outliers are beyond 3 IQRs

40
Shapes of Distributions
  • These representational aides all describe
    frequency distributions the way score
    frequencies are distributed with respect to the
    values of the variable
  • Distributions can take on a number of shapes or
    forms

41
Unimodal Distributions
  • The mode of a distribution refers to the most
    frequently occurring score
  • In a unimodal distribution, one score occurs much
    more frequently than others

42
Multimodal Distributions
  • In multimodal distributions, more than one mode
    exists (or approximately so)
  • In a bimodal distribution, two modes exist

43
Rectangular or Uniform Distributions
  • In a uniform distribution, all values are
    observed equally often

44
Symmetrical and Skewed Distributions
  • A symmetrical distribution is balanced if we cut
    it in half, the two sides would be mirror images
    of one another
  • normal distribution a particular kind of
    distribution that resembles a bell (bell-shaped
    distribution)

45
Skewed Distributions
  • A skewed distribution is unbalanced there may be
    a cluster of scores piling on one end of the scale

46
Skewed
positively skewed distribution (skewed right)
negatively skewed distribution (skewed left)
47
Mean, median and mode
mode
median
mean
mode
median
mean
48
Using different measures of central tendency
  • Two factors are important in making the decision
    of which measure of central tendency should be
    used
  • Scale of measurement (ordinal or numerical)
  • Shape of the distribution of observations.
  • A distribution can be symmetric or skewed to the
    right, positively skewed or to the left,
    negatively skewed.

49
Using different measures of central tendency The
following guidelines help the researcher decide
which measure is best with a given set of data
  • The mean is used for numerical data and for
    symmetric distribution.

50
Using different measures of central tendency The
following guidelines help the researcher decide
which measure is best with a given set of data
  • The median is used for ordinal data or for
    numerical data whose distribution is skewed.

51
Using different measures of central tendency The
following guidelines help the researcher decide
which measure is best with a given set of data
  • The mode is used primarily for nominal or ordinal
    data or for numerical data with bimodal
    distribution.

52
Using different measures of dispersion
  • The following guidelines help investigators
    decide which measure of dispersion is most
    appropriate for a given set of data
  • The standard deviation is used when the mean is
    used i.e., with symmetric distributions of
    numerical data.
  • Percentiles and the interquartile range are used
    in two cases
  • When the median is used i.e., with ordinal data
    or with skewed numerical data.
  • When the mean is used but the objective is to
    compare individual observations with a set of
    norms.
  • The interquartile range is used to describe the
    50 of the distribution, regardless of the shape.
  • The range is used with numerical data when the
    purpose is to emphasize extreme values.
  • The coefficient of variation is used when the
    intent is to compare two numerical distributions
    measured on different scales.

53
General principles concerning the construction of
tables
  • Tables should by fully self-explanatory.
  • Units should be stated for each numerical
    variable
  • Do not try to include too much information in a
    single table. Simplicity, with reduction of
    contents to the minimum is essential.

54
General principles concerning the construction of
tables (cont)
  • The function of ruling is to provide clarity of
    interpretation
  • Unnecessary ruling should be avoided.
  • Spacing can provide the same effect as ruling
  • As a general rule, ruling should be included to
    set off the title of the table, to divide major
    row and column headings, and to close the table
    bottom.



bp sex age Id
124 m 23 1
2
3
4
5
55
General principles concerning the construction of
tables (cont)
  • Numerical entries of zero should be explicitly
    written rather than indicated by a dash or a
    dotted line.
  • --- or __
  • A dash or a dotted line should be reserved for
    data that are missing or unobserved.
  • Zero is a number, and numerical observations of
    zero should be explicitly presented as such.
  • E.g. If a survey shows no cases of poliomyelitis
    in a particular county in a particular year, the
    entry should indicate this fact. If the
    information from that particular county was
    incomplete or otherwise unavailable, a dash or a
    dotted line should be used

56
General principles concerning the construction of
tables (cont)
  • A numerical entry should not begin with a decimal
    point.
  • The reader runs some risk of interpreting a
    leading decimal point as a foreign object.
  • This misinterpretation can be avoided quite
    simply by showing a leading zero immediately to
    the left of the decimal point.
  • E.g. write 0.5 instead of .5.
  • Numbers indicating values of the same
    characteristic should be reported to the same
    number of decimal points.
  • E.g. dont write age21, 23.4, 27.65

57
General principles concerning the construction of
graphs
  • Graphs should by fully explanatory
  • Many readers don't read the detailed text, they
    just look at the graph.
  • The contents of the graph should be as complete
    as possible.
  • Title should include information concerning who
    or what the subjects or experimental material
    are,
  • what observations are abstracted from those
    subjects or material,
  • and what restrictions of time and place apply to
    the graph.
  • E.g. a presentation of birth rates in the state
    of Michigan
  • never be headed merely "Birth Rates,"
  • but might well be modified to say "Birth Rates
    per 1,000 Population, White Race, Michigan,
    1920-1960."
  • If the length of title becomes a problem,
    additional essential material can frequently be
    included in a footnote.
  • In fact the graph should be as self-contained as
    possible, requiring as little outside information
    for clear interpretation as is feasible.

58
General principles concerning the construction of
graphs (cont)
  • Vertical and horizontal scales should by clearly
    labeled and units should be identified.
  • Most graphs present numerical information in
    scaled form.
  • Scales must be labeled in order to describe fully
    the variable presented on the scale, and for
    measurement variables the units of measurement
    should identified.
  • e.g. weight (gms), age (years) etc...

59
General principles concerning the construction of
graphs (cont)
  • Do not try to include too much information in a
    single graph.
  • It is better to include several graphs than to
    compress information too much.
  • A device frequently used for the presentation of
    many curves or trends is the presentation a
    series of small graphs.
  • A safe rule of thumb is to avoid graphs
    containing more than 3 curves.

60
General principles concerning the construction of
graphs (cont)
  • Graphs are intended to give an overview rather
    than a highly detailed picture of a set of data.
  • Do not include too much detail in a graph.
  • Detailed presentations should be reserved for
    tables.
  • Graphs condense detail to permit to see the
    forest rather than the trees.
  • If your main interest is in the trees, use a
    table.
  • The inclusion of too much detail in a graph will
    tend to obscure the essential points.
  • Avoid inclusion of numbers within the body of a
    graph.
Write a Comment
User Comments (0)
About PowerShow.com