Graphical Methods for Complex Surveys - PowerPoint PPT Presentation

1 / 99
About This Presentation
Title:

Graphical Methods for Complex Surveys

Description:

Body Mass Index. measured by. weight in kilograms divided by square of height ... actual body mass index, age, gender, marital status, smoking habits, drinking ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 100
Provided by: drB67
Category:

less

Transcript and Presenter's Notes

Title: Graphical Methods for Complex Surveys


1
(No Transcript)
2
Types of Surveys
  • Cross-sectional
  • surveys a specific population at a given point in
    time
  • will have one or more of the design components
  • stratification
  • clustering with multistage sampling
  • unequal probabilities of selection
  • Longitudinal
  • surveys a specific population repeatedly over a
    period of time
  • panel
  • rotating samples

3
Cross Sectional Surveys
  • Sampling Design Terminology

4
Methods of Sample Selection
  • Basic methods
  • simple random sampling
  • systematic sampling
  • unequal probability sampling
  • stratified random sampling
  • cluster sampling
  • two-stage sampling

5
Simple Random Sampling
  • Why?
  • basic building block of sampling
  • sample from a homogeneous group of units
  • How?
  • physically make draws at random of the units
    under study
  • computer selection methods R, Stata

6
Systematic Sampling
  • Why?
  • easy
  • can be very efficient depending on the structure
    of the population
  • How?
  • get a random start in the population
  • sample every kth unit for some chosen number k

7
Additional Note
  • Simplifying assumption
  • in terms of estimation a systematic sample is
    often treated as a simple random sample
  • Key assumption
  • the order of the units is unrelated to the
    measurements taken on them

8
Unequal Probability Sampling
  • Why?
  • may want to give greater or lesser weight to
    certain population units
  • two-stage sampling with probability proportional
    to size at the first stage and equal sample sizes
    at the second stage provides a self-weighting
    design (all units have the same chance of
    inclusion in the sample)
  • How?
  • with replacement
  • without replacement

9
With or Without Replacement?
  • in practice sampling is usually done without
    replacement
  • the formula for the variance based on without
    replacement sampling is difficult to use
  • the formula for with replacement sampling at the
    first stage is often used as an approximation
  • Assumption the population size is large and the
    sample size is small sampling fraction is less
    than 10

10
Stratified Random Sampling
  • Why?
  • for administrative convenience
  • to improve efficiency
  • estimates may be required for each stratum
  • How?
  • independent simple random samples are chosen
    within each stratum

11
Example Survey of Youth in Custody
  • first U.S. survey of youths confined to
    long-term, state-operated institutions
  • complemented existing Children in Custody
    censuses.
  • companion survey to the Surveys of State Prisons
  • the data contain information on criminal
    histories, family situations, drug and alcohol
    use, and peer group activities
  • survey carried out in 1989 using stratified
    systematic sampling

12
SYC Design
  • strata
  • type (a) groups of smaller institutions
  • type (b) individual larger institutions
  • sampling units
  • strata type (a)
  • first stage institution by probability
    proportional to size of the institution
  • second stage individual youths in custody
  • strata type (b)
  • individual youths in custody
  • individuals chosen by systematic random sampling

13
Cluster Sampling
  • Why?
  • convenience and cost
  • the frame or list of population units may be
    defined only for the clusters and not the units
  • How?
  • take a simple random sample of clusters and
    measure all units in the cluster

14
Two-Stage Sampling
  • Why?
  • cost and convenience
  • lack of a complete frame
  • How?
  • take either a simple random sample or an unequal
    probability sample of primary units and then
    within a primary take a simple random sample of
    secondary units

15
Synthesis to a Complex Design
  • Stratified two-stage cluster sampling
  • Strata
  • geographical areas
  • First stage units
  • smaller areas within the larger areas
  • Second stage units
  • households
  • Clusters
  • all individuals in the household

16
Why a Complex Design?
  • better cover of the entire region of interest
    (stratification)
  • efficient for interviewing less travel, less
    costly
  • Problem estimation and analysis are more complex

17
Ontario Health Survey
  • carried out in 1990
  • health status of the population was measured
  • data were collected relating to the risk factors
    associated with major causes of morbidity and
    mortality in Ontario
  • survey of 61,239 persons was carried out in a
    stratified two-stage cluster sample by Statistics
    Canada

18
OHSSample Selection
  • strata public health units divided into rural
    and urban strata
  • first stage enumeration areas defined by the
    1986 Census of Canada and selected by pps
  • second stage dwellings selected by SRS
  • cluster all persons in the dwelling

19
Longitudinal Surveys
  • Sampling Design

20
Schematic Representation
21
Schematic Representation
22
British Household Panel Survey
  • Objectives of the survey
  • to further understanding of social and economic
    change at the individual and household level in
    Britain
  • to identify, model and forecast such changes,
    their causes and consequences in relation to a
    range of socio-economic variables.

23
BHPS Target Population and Frame
  • Target population
  • private households in Great Britain
  • Survey frame
  • small users Postcode Address File (PAF)

24
BHPS Panel Sample
  • designed as an annual survey of each adult (16)
    member of a nationally representative sample
  • 5,000 households approximately
  • 10,000 individual interviews approximately.
  • the same individuals are re-interviewed in
    successive waves
  • if individuals split off from original
    households, all adult members of their new
    households are also interviewed.
  • children are interviewed once they reach the age
    of 16
  • 13 waves of the survey from 1991 to 2004

25
BHPS Sampling Design
  • Uses implicit stratification embedded in
    two-stage sampling
  • postcode sector ordered by region
  • within a region postcode sector ordered by
    socio-economic group as determined from census
    data and then divided into four or five strata
  • Sample selection
  • systematic sampling of postcode sectors from
    ordered list
  • systematic sampling of delivery points (
    addresses or households)

26
BHPS Schema for Sampling
27
Survey Weights
28
Survey Weights Definitions
  • initial weight
  • equal to the inverse of the inclusion probability
    of the unit
  • final weight
  • initial weight adjusted for nonresponse,
    poststratification and/or benchmarking
  • interpreted as the number of units in the
    population that the sample unit represents

29
Interpretation
  • Interpretation
  • the survey weight for a particular sample unit is
    the number of units in the population that the
    unit represents

30
  • Effect of the Weights
  • Example age distribution, Survey of Youth in
    Custody

31
Unweighted Histogram
32
Weighted Histogram
33
Weighted versus Unweighted
34
Observations
  • the histograms are similar but significantly
    different
  • the design probably utilized approximate
    proportional allocation
  • the distribution of ages in the unweighted case
    tends to be shifted to the right when compared to
    the weighted case
  • older ages are over-represented in the dataset

35
Survey Data Analysis
  • Issues and Simple Examples from Graphical Methods

36
Basic Problem in Survey Data Analysis
37
Issues
  • iid (independent and identical distribution)
    assumption
  • the assumption does not not hold in complex
    surveys because of correlations induced by the
    sampling design or because of the population
    structure
  • blindly applying standard programs to the
    analysis can lead to incorrect results

38
Example Rank Correlation Coefficient
  • Pay equity survey dispute Canada Post and PSAC
  • two job evaluations on the same set of people
    (and same set of information) carried out in 1987
    and 1993
  • rank correlation between the two sets of job
    values obtained through the evaluations was 0.539
  • assumption to obtain a valid estimate of
    correlation pairs of observations are iid

39
Scatterplot of Evaluations
  • Rank correlation is 0.539

40
A Stratified Design with Distinct Differences
Between Strata
  • the pay level increases with each pay category
    (four in number)
  • the job value also generally increases with each
    pay category
  • therefore the observations are not iid

41
Scatterplot by Pay Category
42
Correlations within Level
  • Correlations within each pay level
  • Level 2 0.293
  • Level 3 0.010
  • Level 4 0.317
  • Level 5 0.496
  • Only Level 4 is significantly different from 0

43
Graphical Displays
  • first rule of data analysis
  • always try to plot the data to get some initial
    insights into the analysis
  • common tools
  • histograms
  • bar graphs
  • scatterplots

44
Histograms
  • unweighted
  • height of the bar in the ith class is
    proportional to the number in the class
  • weighted
  • height of the bar in the ith class is
    proportional to the sum of the weights in the
    class

45
Body Mass Index
  • measured by
  • weight in kilograms divided by square of height
    in meters
  • 7.0 lt BMI lt 45.0
  • BMI lt 20 health problems such as eating
    disorders
  • BMI gt 27 health problems such as hypertension
    and coronary heart disease

46
BMI Women
47
BMI Men
48
BMI Comparisons
49
Bar Graphs
  • Same principle as histograms
  • unweighted
  • size of the ith bar is proportional to the number
    in the class
  • weighted
  • size of the ith bar is proportional to the sum of
    the weights in the class

50
Ontario Health Survey
51
Scatterplots
  • unweighted
  • plot the outcomes of one variable versus another
  • problem in complex surveys
  • there are often several thousand respondents

52
(No Transcript)
53
Solution
  • bin the data on one variable and find a
    representative value
  • at a given bin value the representative value for
    the other variable is the weighted sum of the
    values in the bin divided by the sum of the
    weights in the bin

54
(No Transcript)
55
Bubble Plots
  • size of the circle is related to the sum of the
    surveys weights in the estimate
  • more data in the BMI range 17 to 29 approximately

56
Computing Packages
  • STATA and R

57
Available Software for Complex Survey Analysis
  • commercial Packages
  • STATA
  • SAS
  • SPSS
  • Mplus
  • noncommercial Package
  • R

58
STATA
  • defining the sampling design svyset
  • example
  • svyset pweightindiv_wt, strata(newstrata)
    psu(ea) vce(linear)
  • output

pweight indiv_wt VCE linearized Strata
1 newstrata SU 1 ea FPC 1 ltzerogt
59
R survey package
  • define the sampling design svydesign
  • wk1delt- svydesign(idea,stratanewstrata,weight
    indiv_wt,nestT,datawork1)
  • output

gt summary(wk1de) Stratified 1 - level Cluster
Sampling design With (1860) clusters. svydesign(id
ea, strata newstrata, weight indiv_wt,
nest T, data work1)
60
Syntax
  • STATA
  • svy estimate
  • Example least squares estimation
  • svyset pweightindiv_wt, strata(newstrata)
    psu(ea)
  • svy regress dbmi bmi
  • R
  • svy(, design, data, ...)
  • Example least squares estimation
  • wk2delt-svydesign(idea,stratanewstrata,weight
    indiv_wt,nestT,datawork2)
  • svyglm(dbmibmi, datawork2,designwk2de)

61
Available Survey Commands
62
Survey Data Analysis
  • Contingency Tables
  • and
  • Issues of Estimation of Precision

63
General Effect of Complex Surveys on Precision
  • stratification decreases variability (more
    precise than SRS)
  • clustering increases variability (less precise
    than SRS)
  • overall, the multistage design has the effect of
    increasing variability (less precise than SRS)

64
Illustration Using Contingency Tables
  • two categorical variables that can be set out in
    I rows and J columns
  • can get a survey estimate of the proportion of
    observations in the cell defined by the ith row
    and jth column

65
Example Ontario Health Survey
  • rows five levels describing levels of happiness
    that people feel
  • columns four levels describing the amount of
    stress people feel
  • Is there an association between stress and
    happiness?

66
STATA Commands
67
STATA Output
  • table on stress and happiness
  • estimated proportions in the table with test
    statistic

68
Possible Test Statistics
  • adapt the classical test statistic
  • need the sampling distribution of the statistic
  • Wald Test
  • need an estimate of the variance-covariance matrix

69
Estimation of Variance or Precision
  • variance estimation with complex multistage
    cluster sample design
  • exact formula for variance estimation is often
    too complex use of an approximate approach
    required
  • NOTE taking account of the design in variance
    estimation is as crucial as using the sampling
    weights for the estimation of a statistic

70
Some Approximate Methods
  • Taylor series methods
  • Replication methods
  • Balanced Repeated Replication (BRR)
  • Jackknife
  • Bootstrap

71
Replication Methods
  • you can estimate the variance of an estimated
    parameter by taking a large number of different
    subsamples from your original sample
  • each subsample, called a replicate, is used to
    estimate the parameter
  • the variability among the resulting estimates is
    used to estimate the variance of the full-sample
    estimate
  • covariance between two different parameter
    estimates is obtained from the covariance in
    replicates
  • the replication methods differ in the way the
    replicates are built

72
Assumptions
  • The resulting distribution of the test statistic
    is based on having a large sample size with the
    following properties
  • the total number of first stage sampled clusters
    (or primary sampling units) is assumed large
  • the primary sample size in each stratum is small
    but the number of strata is large
  • the number of primary units in a stratum is large
  • no survey weight is disproportionately large

73
Possible Violations of Assumptions
  • the complex survey (stratified two-sample
    sampling, for example) was done on a relatively
    small scale
  • a large-scale survey was done but inferences are
    desired for small subpopulations
  • stratification in which a few strata (or just
    one) have very small sampling fractions compared
    to the rest of the strata
  • The sampling design was poor resulting in large
    variability in the sampling weights

74
Survey Data Analysis
  • Linear and Logistic
  • Regression

75
General Approach
  • form a census statistic (model estimate or
    expression or estimating equation)
  • for the census statistic obtain a survey estimate
    of the statistic
  • the analysis is based on the survey estimate

76
Regression
  • Use of ordinary least squares can lead to
  • badly biased estimates of the regression
    coefficients if the design is not ignorable
  • underestimation of the standard errors of the
    regression coefficient if clustering (and to a
    lesser extent the weighting) is ignored

77
Example Ontario Health Survey
  • Regress desired body mass index (DBMI) on body
    mass index (BMI)

78
Simple Linear Regression Model
  • typical regression model
  • linear relationship plus random error
  • errors are independent and identically
    distributed

79
Census Statistic
  • census estimate of the slope parameter ?
  • Problem the assumption of independent errors in
    the population does not hold
  • Solution the least squares estimate is a
    consistent estimate of the slope ?

80
Survey Estimate
  • the census estimate B is now the parameter of
    interest
  • the survey estimate is given by
  • estimate obtained from an estimating equation
  • the estimate of variance cannot be taken from the
    analysis of variance table in the regression of y
    on x using either a weighted or unweighted
    analysis

81
Variance Estimation
  • Again, estimate of the variance of b is obtained
    from one of the following procedures
  • Taylor linearization
  • Jackknife
  • BRR
  • Bootstrap

82
Issues in Analysis
  • application of the large sample distributional
    results
  • small survey
  • regression analysis on small domains of interest
  • multicollinearity
  • survey data files often have many variables
    recorded that are related to one another

83
Multicollinearity Example Ontario
Health Survey
  • Two regression models regress desired body mass
    index on
  • actual body mass index, age, gender, marital
    status, smoking habits, drinking habits, and
    amount of physical activity
  • all of the above variables plus interaction
    terms marital status by smoking habits, marital
    status by drinking habits, physical activity by
    age

84
Partial STATA Output
  • No interaction terms
  • Interaction terms present

85
Comparison of Domain Means
  • Domains and Strata
  • both are nonoverlapping parts or segments of a
    population
  • usually a frame exists for the strata so that
    sampling can be done within each stratum to
    reduce variation
  • for domains the sample units cannot be separated
    in advance of sampling
  • Inferences are required for domains.

86
Regression Approach
  • use the regression commands in STATA and declare
    the variables of interest to be categorical
  • example DBMI relative to BMI related to sex and
    happiness index
  • STATA commands

87
STATA Output
88
Logistic Regression
  • probability of success pi for the ith individual
  • vector of covariates xi associated with ith
    individual
  • dependent variable must be 0 or 1, independent
    variables xi can be categorical or continuous
  • Does the probability of success pi depend on the
    covariates xi and in what way?

89
Census Parameter
  • Obtained from the logistic link function
  • and the census likelihood equation for the
    regression parameters
  • Note it is the log odds that is being modeled in
    terms of the covariate

90
Example Ontario Health Survey
  • How does the chance of suffering from
    hypertension depend on
  • body mass index
  • age
  • gender
  • smoking habits
  • stress
  • a well-being score that is determined from
    self-perceived factors such as the energy one
    has, control over emotions, state of morale,
    interest in life and so on

91
STATA Commands
92
STATA Output part I
93
STAT Output part II
94
GEE Generalized Estimating Equations
  • Dependent or response variable
  • well-being measured on a 0 to 10 scale
  • focus is on women only
  • Independent or explanatory variables
  • has responsibility for a child under age 12 (yes
    1, no 2)
  • marital status (married 1, separated 2,
    divorced 3, never married 5 widowed removed
    from the dataset)
  • employment status (employed 1, unemployed 2,
    family care 3)
  • STATA syntax
  • tsset pid year, yearly
  • xi xtgee wellbe i.mlstat i.job i.child i.sex
    pweight axrwght, family(poisson)
    link(identity) corr(exchangeable)

95
GEE Results
96
For each type of initial marital status
  • Married
  • Separated or divorced
  • Never married

97
Cox Proportional Hazards Model
  • Dependent or outcome variable
  • time to breakdown of first marriage
  • Independent or explanatory variables
  • gender
  • race (white/non-white)
  • Age in 1991 (restricted to 18 60)
  • financial position comfortable1, doing
    alright2, just about getting by3, quite
    difficult4, very difficult 5

98
STATA Commands
  • Command for survival data set up
  • Command for Cox proportional hazards mode

99
STATA Output
Write a Comment
User Comments (0)
About PowerShow.com