Statistics bootcamp - PowerPoint PPT Presentation

1 / 99
About This Presentation
Title:

Statistics bootcamp

Description:

Ordinal, aka orderable discrete ... an ordinal (categorical) variable ... ordinal variables can be changed into nominal variables, and ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 100
Provided by: Laine
Category:

less

Transcript and Presenter's Notes

Title: Statistics bootcamp


1
Statistics bootcamp
  • Laine Ruus
  • Data Library Service, University of Toronto
  • Rev. 2005-04-26

2
Outline
  • Describing a variable
  • Describing relationships among two or more
    variables

3
First, some vocabulary
  • Variable In social science research, for each
    unit of analysis, each item of data (e.g., age of
    person, income of family, consumer price index)
    is called a variable.
  • Unit of analysis The basic observable entity
    being analyzed by a study and for which data are
    collected in the form of variables. A unit of
    analysis is sometimes referred to as a case or
    "observation"

4
The variable Sex can be coded as
  • 1male 2 female 3 no response, or
  • 1female 2 male, or
  • 1male 2 female, or
  • Mmale F female, or
  • male female, or even
  • 1yes 2 no 3 maybe

5
The values a variable can take must be
  • exhaustive include the characteristics of all
    cases, and
  • mutually exclusive each case must have one and
    only one value or code for each variable

6
Whats wrong with this coding scheme?
  • Under 3,000
  • 3,000-7,000
  • 8,000-12,000
  • 13,000-17,000
  • 18,000-22,000
  • 23,000-27,000
  • 28,000-32,000
  • 33,000-37,000
  • 38,000 over
  • (Source Census of Canada, 1961 user summary
    tapes)

7
Variables are normally coded numerically, because
  • arithmetic is easier with numbers than with
    letters
  • some characteristics are inherently numeric age,
    weight in kilograms or pounds, number of children
    ever borne, income, value of dwelling, years of
    schooling, etc.
  • space/size of data sets has, until recently, been
    a major consideration

8
Three basic types of variables
  • Categorical
  • Nominal, aka nonorderable discrete
  • Eg gender, ethnicity, immigrant status, province
    of residence, marital status, labour force
    status, etc.
  • Ordinal, aka orderable discrete
  • Eg highest level of schooling, social class,
    left-right political identification, Likert
    scales, income groups, age groups, etc.
  • Continuous, aka interval, numeric
  • Eg actual age, income, etc.

9
Descriptive statistics summarize the properties
of a sample of observations
  • how the units of observation are the same
    (central tendency)
  • how they are different (dispersion)
  • how representative the sample is of the
    population at large (significance)

10
Nominal variable
  • Central tendency
  • mode
  • Dispersion
  • frequency distribution
  • percentages, proportions, odds
  • Index of qualitative variation (IQV)
  • Significance
  • coefficient of variation (CV)
  • Visualization
  • bar chart

11
Mode the category with the largest number or
percentage of observations in a frequency
distribution
12
The frequencies can be visualized as a bar chart
(based on percentages)
13
The same distribution, from one of the Canadian
overview files
14
and showing percentages
15
Notice the differences
16
Why the differences?
  • What is the population in each table?
  • What is in the denominator in each table?
  • Which one is correct?

17
The most important thing to know about any
distribution, whether it is a rate, a
proportion, or a percent, is what is in the
denominator. And it must always be reported.
18
We can also derive the distribution information
from the 2001 individual pumf using a statistical
package
19
If we weight the distribution, both the
frequencies and the percentages will almost match
the distribution from the Profile file
20
Just a few words on weighting
  • The weight is the chance that any member of the
    population (universe) had to be selected for the
    sample
  • In general
  • the weight can be used to produce estimates for
    the total population (population weight)
  • and/or the weight can adjust for known
    deficiencies in the sample (sample weight)

21
and a few more words on weighting
  • The 2001 census public use microdata file of
    individuals is a 2.7 sample of the population
  • The weight variable (weightp) ranges from
    35.545777-39.464996
  • Knowing who was excluded from the sample is as
    important as knowing who was included

22
And some final words on weighting
  • When to use the population weight variable
  • when you are producing frequencies to reflect the
    frequency in the population
  • When you dont need to use the population weight
    variable
  • when you are producing percentages, proportions,
    ratios, rates, etc.
  • When to use a sample weight variable
  • always

23
Proportions, percents, and odds
  • Percent the percent married in the population
    15 years and over is
  • (married/population gt15 years)10049.47
  • Proportion the proportion married in the
    population 15 years and over is
  • (married/population gt15 years).4947
  • Odds the odds on being married are
  • (married/not married in the population) or
  • proportion married/(1-proportion married)
  • .4947/(1-.4947).4947/.5053.9790

24
Coefficient of variation
  • measures how representative the variable in the
    sample is of the distribution in the population
  • computed as ((standard deviation/mean)100)
  • we will discuss these measures in the context
    of continuous variables
  • see Stats Can guidelines in user guides
  • cvlt 16.6 is ok to publish, cvgt33.3 do not
    publish
  • SDA reports the cv when generating frequencies

25
(No Transcript)
26
Ordinal variable
  • Central tendency
  • median and mode
  • Dispersion
  • frequency distribution
  • range
  • percentages/quantiles, proportions, odds
  • Index of qualitative variation (IQV)
  • Significance
  • coefficient of variation (CV)
  • Visualization
  • histogram

27
Example frequency distribution
28
Example relative percentages
29
The Median is the value that divides an orderable
distribution exactly into halves.
  • Finding the median is easier if we compute
    cumulative percentages, eg in Excel

30
Cum 0-4 years 5.65 5.65 5-9 years
6.59 12.24 10-14 years 6.84 19.08 15-24
years 13.36 32.44 25-34 years 13.31 45.75 35-4
4 years 17.00 62.75 45-54 years 14.73 77.48 55
-64 years 9.56 87.04 65-74 years
7.14 94.18 75-84 years 4.43 98.61 85 years
and over 1.39 100
31
So how can we describe this distribution, using
the vocabulary we have so far?
  • What is the mode of this distribution?
  • What is the median?
  • What is the range?

32
(No Transcript)
33
A statistical package can also report these
measures
34
Percentiles/quantiles
  • percentiles/quantiles are the value below which a
    given percentage of the cases fall
  • the median the 50th percentile
  • quartiles are the values of a distribution broken
    into 4 even intervals each containing 25 of the
    cases

35
Interquartile range
  • The interquartile range is the difference between
    the value at 75 of cases, and the value at 25
    of cases
  • What is the interquartile range for the following
    distribution?

36
Cum 0-4 years 5.65 5.65 5-9 years
6.59 12.24 10-14 years 6.84 19.08 15-24
years 13.36 32.44 25-34 years 13.31 45.75 35-4
4 years 17.00 62.75 45-54 years 14.73 77.48 55
-64 years 9.56 87.04 65-74 years
7.14 94.18 75-84 years 4.43 98.61 85 years
and over 1.39 100
37
Continuous variable
  • Central tendency
  • mean, median and mode
  • Dispersion
  • range
  • variance or standard deviation
  • quantiles/percentiles
  • interquartile range
  • Significance
  • standard error
  • coefficient of variation
  • Visualization
  • polygon (line graph)

38
Means, variances, and standard deviations
  • Mean the average, computed by adding up the
    values of all observations and dividing by the
    number of observations (cases)
  • Variance computed by taking each value and
    subtracting the value of the mean from it,
    squaring the results, adding them up, and
    dividing by the number of cases (actually N-1)
  • Standard deviation the value that cuts off 68
    of the cases above or below the mean, in a normal
    distribution. Its the square root of the
    variance, in the same metric as the variable.

39
Availability of continuous variables in Stats Can
products
  • Stats Can rarely publishes truly continuous
    variables in its aggregate statistics products
  • Some exceptions are
  • age by single years (census)
  • estimates of population by single years of age
    (Annual demographic statistics)

40
Statistics Canada generally reports the
distribution of continuous variables as
  • Measures of central tendency
  • an ordinal (categorical) variable
  • an average (mean) and standard error, variance,
    or standard deviation
  • a median
  • rates (other than of 100), eg. children ever born
    per 1,000 women in the 1991 census 2B profile
  • Measures of dispersion
  • percentage essentially, a rate per 100
    population. Eg, Incidence of low income as a of
    economic families, in the census profiles. This
    includes employment and unemployment rates
  • quantiles (eg quintiles in Income trends in
    Canada)
  • Gini coefficients (computed from the coefficient
    of variation) (eg Income trends in Canada)
  • Measures of significance
  • standard error

41
In the following distribution
  • What is the median?
  • What is the mean?
  • What is the range?

42
(No Transcript)
43
Using the percentages and cumulative percentages
  • What is your best estimate of the interquartile
    range?
  • What is your best estimate of the standard
    deviation?
  • Why is the average (mean) income so much higher
    than the median?
  • See pages 18-19 of your handout

44
As a percentage distribution
45
The polygon produced by Beyond 20/20 isnt very
useful
46
Using the standard error to describe more of the
distribution
  • standard error of the mean is a measure of how
    likely it is that the mean in the data we are
    looking at is the same as or similar to the mean
    in the population at large
  • computed from the standard deviation divided by
    the square root of the N
  • the larger the N, the smaller the standard error,
    and the more confidence we can have in the
    distribution in the sample as representative of
    the population

47
For example.
48
Confidence intervals
  • The standard error makes most sense when we use
    it to compute a confidence interval around the
    mean
  • For Canada, in the previous example
  • 95 upper confidence limit (UCL)
  • (mean)1.96(standard error)
  • 297691.96(19) 29769 37.2429,806.24
  • 95 lower confidence limit (LCL)
  • (mean)-1.96(standard error)
  • 29769 -1.96(19) 29769 - 37.24 29,731.76
  • 1.96 is the Z-statistic that represents 95 of a
    normal distribution
  • The handout contains computed confidence
    intervals for each of the provinces (p.22)

49
How do we interpret this?
  • if we draw repeated random samples from the same
    population, 95 of them will have a mean total
    income between 29,732 and 29,806
  • this is not the same as saying that we are 95
    confident that the population mean falls within
    those two limits.

50
Using microdata
  • Statistical packages such as SAS, SPSS, etc. will
    compute the mean, the standard deviation,
    standard error, etc. for continuous variables
  • SDA will only report these measures for variables
    with less than 8,000 values

51
(No Transcript)
52
For the following example
53
How many values?
  • What is the maximum number of values for the
    wagesp variable? For the totincp variable?
  • If you were to do a frequency distribution of
    wagesp, how many rows might you have in the
    distribution?
  • How would you go about finding out what the modal
    category for wagesp is?

54
Some final points about variables
  • they are not immutable or un-changeable
  • continuous variables can be changed into ordinal,
    or even nominal variables
  • ordinal variables can be changed into nominal
    variables, and
  • nominal variables can be collapsed still further
  • nominal and ordinal variables can be combined to
    create indices or scales

55
Describing relationships among two or more
variables
  • Objectives of describing multivariate
    relationships
  • description
  • improved ability to predict the value of a
    variable for a case, by using the value of
    another variable (or variables)
  • examine causation, which requires
  • covariation
  • the causal variable must occur before the outcome
    variable in time (temporal precedence)
  • A non-spurious relationship

56
Variables are either
  • Dependent (aka outcome variables) or
  • Independent (eg causal variables)

57
The same variable can be an independent variable
in one hypothesis, and a dependent variable in
another hypothesis
  • Does gender make a difference in level of
    education?
  • level of education is dependant gender is
    independant
  • Does level of education make a difference to
    earned income?
  • earned income is dependant level of
    education is independent
  • Is the effect of education on earned income the
    same for men and women?
  • earned income is dependant level of
    education is independent, and gender is the
    control variable

58
Statistical measures describe
  • The strength of the relationship between two (or
    more) variables
  • The direction of the relationship between two (or
    more) variables
  • The significance of the relationship between two
    (or more) variables in the sample vis-à-vis the
    population

59
The choice of appropriate statistical technique
is dependant on
  • whether the dependant variable is nominal,
    ordinal, or continuous
  • whether the dependant variable is a dichotomy or
    a polytomy
  • whether the independent variable(s) is a
    dichotomy, a polytomy, or continuous

60
Relationships between two or more nominal or
ordinal variables
  • cross-tabulations
  • a common census output product (all the
    Topic-based tabulations in 2001)
  • strength and direction
  • odds ratios
  • significance
  • chi-square statistic

61
A cross-tabulation
62
Computing the odds ratio
63
Graphed in Excel, it looks like this
64
The chi-square statistic
  • Tells us how likely we are to be wrong if we say
    there is a relationship between these two
    variables
  • Significance (probability of being wrong)
  • Evaluated based on
  • Critical value (found in statistics texts)
  • Degrees of freedom (df1)
  • Level of significance (a 5 possibility of being
    wrong is normally acceptable in the social
    sciences)

65
So we know
  • Direction strength from graphing the odds
  • Significance from a chi-square statistic
  • Can be computed in eg Excel
  • For the previous table, 2.97 (see handout page
    30)
  • Critical value 3.84 with 1 degree of freedom, at
    .05 significance (ie probability of being wrong)
  • This table is not statistically significant,
    therefore a good chance of seeing this
    relationship as a chance of eg measurement error

66
Adding a control variable in Beyond 20/20
67
Computing the conditional odds
68
And graphed in Excel, it looks like this
69
The original 1996 2-way table looked like this
70
But when we added visible minority as a control
variable
  • it ended up looking like this

71
(No Transcript)
72
Did you notice?
  • In the 1996 data, when we looked only at
    immigrants, versus non-immigrants, the
    non-immigrants were more likely to be employed
  • When we break it out by visible minority status,
  • Non-visible majority immigrants are more likely
    to be employed than non-immigrants
    (11.73/9.161.28)
  • Visible minorities immigrants are also more
    likely to be employed than non-immigrants
    (6.30/5.521.14)
  • This is an example of Simpsons paradox

73
Using a statistical package to examine these
relationships
  • Automatically computes a chi-square
  • Computes degrees of freedom and probability
  • Is sensitive to sample size (so we need to take a
    subsample)

74
(No Transcript)
75
And produces these results
76
In the table on the previous slide
  • How many cases are there in this subsample?
  • What percentage overall are unemployed?
  • What percentage of immigrants are unemployed?
  • Is this different from the percentage of
    non-immigrants that are unemployed?
  • What percentage are immigrants? How would you
    find out?

77
Now, you compute the tables for the visible
minority groups
78
And tell me what you found
  • How many tables are produced?
  • What is the difference between them?
  • How many of these tables are statistically
    significant (based on the Chisq)?
  • In which two groups is the relationship the
    opposite of the majority of the groups?

79
Comparison of means
  • Relationship between a continuous dependant
    variable, and a dichotomous independent variable
  • Significance
  • Students t-statistic (similar to Z-statistic)
  • Example average employment income for visible
    minorities versus all others (2001 census
    dimensions series)

80
(No Transcript)
81
Visible minority status and employment income
  • In 1996 census, a Dimensions series table showed
    visible minority status and employment income
  • This table not available in 2001 census
  • Relationship can only be examined using
    microdata, or requesting a custom tabulation

82
(No Transcript)
83
And the result is
84
Computing 95 confidence intervals
85
ANOVA (analysis of variance)
  • When the dependent variable is a continuous
    variable, and the independent variable is a
    polytomy
  • Significance
  • F-ratio
  • Eta-squared (amount of explained variance)
  • Example mean earned income for individual
    visible minorities

86
(No Transcript)
87
And the result is
88
And the associated statistical measures
89
Rerunning with highest level of schooling
90
And the associated statistical tests
91
Pairwise relationships among continuous and
dichotomous variables
  • Correlations
  • Strength and direction
  • Pearson correlation coefficient
  • Significance
  • Z-statistics for each pair
  • (r-to-Z transformation table)

92
The correlation matrix
93
Combined effects of one or more independent
variables on a continuous dependant variable
  • Regression analysis
  • Categorical variables must be recoded to dummy
    variables
  • Strength and direction
  • t-statistic for each independent variable
  • Significance
  • F-ratio

94
(No Transcript)
95
(No Transcript)
96
(No Transcript)
97
Techniques weve covered
98
What you need to remember
  • To examine the relationships between 2 or more
    variables in Beyond 20/20, the variables must be
    in the same file and in different dimensions
  • If no table exists with those variables, the
    alternative is to use a relevant microdata file
  • To generate the table (or request a custom
    tabulation from Statistics Canada)
  • To compute the more complicated measures of
    association and significance

99
What you need to remember (contd)
  • A user who needs to do correlations or regression
    analysis needs continuous outcome (dependant
    variables)
  • Information about statistical techniques is
    readily available on the WWW
Write a Comment
User Comments (0)
About PowerShow.com