Title: Correcting for Self-Selection Bias Using the Heckman Selection Correction in a Valuation Survey Using Knowledge Networks
1Correcting for Self-Selection Bias Using the
Heckman Selection Correction in a Valuation
Survey Using Knowledge Networks
- Presented at the 2005 Annual Meeting of the
American Association of Public Opinion Research - Trudy Cameron, University of Oregon
- J. R. DeShazo, UCLA
- Mike Dennis, Knowledge Networks
2Motivating Insights
- It is possible to have 20 response rate, yet
still have a representative sample - It is possible to have 80 response rate, yet
still have very non-representative sample - Need to know
- What factors affect response propensity
- Whether response propensity is correlated with
the survey outcome of interest
3Research Questions
- Can the inferences from our two samples of
respondents from the Knowledge Networks Inc (KN)
consumer panel be generalized to the population? - Do observed/unobserved factors that affect the
odds of a respondent being in the sample also
affect the answers that he/she gives on the
survey?
4What we find
- Insignificant selectivity using one method
- Significant, but very tiny, selectivity using an
alternative method - A bit disappointing for us (no sensational
results, reduced publication potential) - Probably reassuring for Knowledge Networks, since
our samples appear to be reasonably representative
5Heckman Selectivity Correction Intuition Example
1
- Suppose your sample matches the US population on
observables age, gender, income - Survey is about government regulation
- Suppose liberals are more likely to fill out
surveys, but no data on political ideology for
non-respondents (unobserved heterogeneity) - Sample will have disproportionate number of
liberals - Your sample is likely to overstate sympathy for
government regulation (sample selection bias)
6Heckman Selectivity Correction Intuition
Example 2
- Suppose your sample matches the US population on
observables age, gender income - Survey is about WTP for health programs
- People who are fearful about their future health
are more likely to respond, but have no data for
anyone on fearfulness (the salience of health
programs) - Sample will tend to overestimate WTP for health
programs
7 Level curves,
8 Level curves,
9The selection process
- Several phases of attrition (transitions) in KN
samples. Assume RDDrandom - RDD ? recruited
- Recruited ? profiled
- Profiled ? active at time of sampling
- Active ? drawn for survey sample
- Drawn ? member of estimating sample
- Can explore a different selection process for
each of these transitions
10Exploit RDD telephone numbers
- Recruits can be asked their addresses
- Some non-recruit numbers matched to addresses
using reverse directories - Phone numbers with no street address? Matched
(approximately) to best census tract using the
geographic extent of the telephone exchange - Link to other geocoded data
11Panel protection during geocoding
- Using dummy identifiers, match street addresses
to relevant census tract (or telephone exchange
to tract, courtesy Dale Kulp at MSG), return data
to KN - Get back from KN the pool of initial RDD
contacts, minus all confidential addresses, with
our respondent case IDs restored - Merge with auxiliary data about census tract
attributes and voting behaviors
12Cameron and Crawford (2004)
- Data for each of the 65,000 census tracts in the
year 2000 U.S. census - 95 count variables for different categories
- Convert to proportions of population (or
households, or dwellings) - Factor analysis 15 orthogonal factors that
together account for 88 of variation in
sociodemographic characteristics across tracts
13Categories of Census Variables
- Population density
- Ethnicity Gender Age distribution
- Family structure
- Housing occupancy status Housing characteristics
- Urbanization Residential mobility
- Linguistic isolation
- Educational Attainment
- Disabilities
- Employment Status
- Industry Occupation Type of income
14Labels 15 orthogonal factors
"well-to-do prime" "elderly disabled"
"well-to-do seniors" "rural farm., self-employ.
"single renter twenties" "low mobil., stable neigh."
"unemployed" "Native American"
"minority single moms" "female"
"thirty-somethings" "health-care workers"
"working-age disabled" "asian-hispanic-multi, language isolation"
"some college, no grad"
152000 Presidential Voting Data
- Leip (2004) Atlas of U.S. Presidential Elections
vote counts for each county - Use of county votes for Gore, Nader,
versus Bush and others (omitted category) - will not be orthogonal to our 15 census factors
16Empirical Illustrations
- 1. Analysis of government question in public
interventions sample--by naïve OLS, and via
preferred Heckman model - 2. Analysis of selection processes leading to
private interventions sample - Marginal selection probabilities
- Conditional selection probabilities
- Allow marginal utilities to depend on propensity
to respond to survey
17Analysis 1Heckman Selectivity Model
- Public intervention sample
- Find an outcome variable that is
- Measures an attitude that may be relevant to
other research questions - Can be treated as cardinal and continuous
(although it is actually discrete and ordinal) - Can be modeled (naively) by OLS methods
- Can be generalized into a two-equation FIML
selectivity model
18Government Involvement in Regulating Env.,
Health, Safety?
19Heckman correction modelbias not statistically
significant
- Fail to reject at 5 level, at 10 level (but
close) - Point estimate of error correlation 0.10
- May be more likely to respond if approve of govt
reg - Interpretation Insufficient signal-to-noise to
conclude that there is non-random selection - Reassuring, but could stem from noise due to
- Census tract factors, county votes rather than
individual characteristics - Treating ordinal ratings as cardinal and
continuous
20Implications for govt variable
21Analysis 2Conditional Logit Choice Models
- Very attractive properties for analyzing multiple
discrete choices,but - No established methods for joint modeling of
selection propensity and outcomes in the form of
3-way choices
22 Testable Hypothesis
- Do the marginal utilities of key attributes
depend upon the fitted selection index (or the
fitted selection probability)? - If yes then observable heterogeneity in the
odds of being in the sample contributes to
heterogeneity in the apparent preferences in the
estimating sample - If no greater confidence in representativeness
of the estimating sample (although still no
certainty)
23RDD contact dispositions
24Results Response propensity models
- Use response propensity as a shifter on the
parameters of conditional logit choice models - Only one marginal utility parameter (related to
the disutility of a sick-year) appears robustly
sensitive to selection propensity - Baseline coefficient is -50 units, average shift
coefficient is on the order of 3 units, times a
deviation in fitted response probabilities that
averages about 0.004
25Conclusions 1
- For our samples hard to find convincing and
robust evidence of substantial sample selection
bias in models for outcome variables in two KN
samples - Good news for Knowledge Networks not so good for
us, as researchers, in terms of the publication
prospects for this dimension of our work
26Conclusions 2
- Analysis 1 Insignificant point estimate of bias
in distribution of attitudes toward regulation on
the order of 10 too much in favor - Analysis 2 Statistically significant (but tiny)
heterogeneity in key parameters across response
propensities in systematically varying parameters
models
27Guidance
- High response rates do not necessarily eliminate
biases in survey samples - Weights help (so sample matches population on
observable dimensions), but weights are not
necessarily a fix if unobservables are correlated - Cannot tell if correlated unobservables are a
problem without doing this type of analysis - Need to model the selection process explicitly
- Need to explain differences in response
propensities - Need analogous data for respondents and for
people who do not respond - E.g. Census tract factors and county voting
percentages - Anything else that might capture salience of
survey topic (e.g. county mortality rates from
same diseases covered in survey, hospital
densities)