Title: Title of the Presentation
1The Care, Feeding, and Training of Survey
Statisticians
Sharon L. Lohr
2Care and Feeding of Iguanas
- Iguana iguana
- Natural sunlight
- Variety of fruits and vegetables
- Water
- Bathing is a good habit
drexotic.com/care_iguanas Picture from Wikipedia
3Care and Feeding of Puppies
- Canis lupus familiaris
- Balanced diet
- Exercise
- Socialization
- Bathing is a good habit
4Care and Feeding of Survey Statisticians
- Statisticus exemplus repræsentativus
- Balanced diet
- Exercise
- Natural sunlight
- Socialization
- Bathing is a good habit
5Survey Sampling
Ethnography
Psychology
Psychology
Statistics
Management
Lots more
Geography
6Balanced Diet
- Mathematical and statistical nutrients at
university - Sampling courses
- Other aspects of training and care
7Essence of Survey Sampling
- How to generalize from seen to unseen?
- Quantify uncertainty about population
- 18th, 19th Century
- Immanuel Kant
- Charles Peirce
- John Venn
- Adolphe Quetelet
- What is P(sun will rise tomorrow)?
81920s and 1930s
- Convenience, judgment samples
- Models (usually not explicitly stated)
- Faith
- Famous example Literary Digest Survey
- Correct winner, every election 1912-1932
- Uncanny accuracy n 2.3 million
- 1936 predicted Landon with 55
- 1936 Roosevelt won with 61
91940 Probability sampling
- Revolutionary Idea Inference is based on random
variables for sample inclusion - Fisher, Neyman, Mahalanobis, Hansen
- Robust, nonparametric approach
10Probability Sampling
Not sampled Zi 0
Sample Zi 1
y3
y2
y4
y7
y6
y5
y1
y8
y9
ys fixed random variables Zi
111960s Predictive approach
- Use stochastic model about quantity y to predict
the values of y not in the sample - Brewer, Royall, Dorfman, Valliant
- Balanced sampling
- Can model nonresponse
12Model-based inference
Predict values of y not in sample Y f (x) e
Sample
y3
y2
y4
y7
y6
y5
y1
y8
y9
Inference depends on stochastic model
13HHM Volume I (1953)
- Sampling Principles
- Biases, Nonsampling Errors
- Sample Designs
- SRS, Stratified, One- Two-Stage Cluster
Sampling, Stratified Multistage - Control of Variation in Cluster Size
- Estimating Variances
- Regression Estimates, Double Sampling, Other
- Case Studies
14HHM Volume II (1953)
- Fundamental Theory of Probability
- Derivations for Chapters of Volume 1
- Response Errors in Surveys
15What diet are students getting?
- SRS of 80 university programs that offer MS or
PhD in Statistics or Biostatistics - Exclude JPSM, Iowa State, UNC, UNL
- Sampling frame www.amstat.org listings
- Thank you, Burcu Eke!
16Basic syllabus
HHM Vol. 1
- Sampling Principles
- Biases, Nonsampling Errors
- Sample Designs
- SRS, Stratified, One- Two-Stage Cluster
Sampling, Stratified Multistage - Control of Variation in Cluster Size
- Estimating Variances
- Regression Estimates, Double Sampling, Other
- Case Studies
- SRS
- Stratified
- Cluster
- Multistage
- Ratio, regression estimation
17Beyond Basics
- Replication variance estimation
- Nonresponse models, calibration
- Regression, categorical data
- Spatial sampling
- Adaptive sampling
- Model-based inference
18SRS of 80 Grad Programs
No class 21 Not offered 9
19Exercise Analyze Survey Data
- Download data from fedstats.gov
- Codebook, SAS code
- Investigate topics of interest to students
- Graph data
- Multivariate analyses
- Regression, logistic regression, categorical
- Discuss nonsampling errors
- Variance estimation
20Exercise Analyze Survey Data
- Cholesterol, obesity (NHANES)
- Predicting number of friends (Add Health)
- Energy-saving systems consumption (Commercial
Buildings Energy Consumption Survey) - Math scores, sex, calculator use (TIMSS)
- Jackknife macros
21Exercise Design
- Work on all steps of a survey
- Survey center helpful, not necessary
- Take sample from Internet data
- amazon.com
- Treat large data set as population
- IPUMS, baseball
- Compare sampling designs
- Generate nonresponse
22Exercise Inferential Framework
- Population N 100
- Take SRS of size 30
- X1 mean of first sample
- Put them back
- Take a second SRS of size 30
- X2 mean of second sample
- Are X1 and X2 independent?
- Model-, design-based simulations in R
23Socialization
- Students need to work with people outside
statistics - Socialize with other statisticians
- Exposure to new ideas
- Integrate sampling with other classes
24Bathing
- Need to cleanse old, crusted concepts
- What are main goals?
- Would I teach this material if starting over?
- Do students really need to work out small samples
by hand? - Want data-centric training
- Problem solvers
25Sunlight
- Instead of preparing statisticians for survey
problems of 1950, look at - What a survey statistician actually does
- What a survey statistician might need to do in
the future
26Current Research Topics
- Weighting and weight smoothing / trimming
- Computer-intensive variance estimation
- Visualization
- Multi-mode, multi-frame
- Small area, disease mapping
- Nonparametric, robust models for surveys
- Time series / spatial methods
- Record linkage, administrative data
- Confidentiality
- Nonresponse, calibration, imputation
27Technology and Sampling
- 1940s Errors in surveys
- Depression, war Need for data
- Sampling lower cost, fewer errors
- Computing
- 1960s Telephone, errors, computing
- Measurement error
- Model-based inference
- 1980s Computing ? Replication variance
estimation methods, data analysis
282000s Internet
- Inexpensive data collection
- But
- Coverage problems
- Nonresponse
- Measurement error
- Opportunity for ingenuity in sample design
HHM, V1, p. 456
291920s and 1930s
- Convenience, judgment samples
- Models (usually not explicitly stated)
- Faith
- Literary Digest Survey
- Claimed accuracy
- Predicted correct winner, 1912-1932
302000s
- ACS, other govt surveys high quality data
- Volunteer (or paid) online panel polls
- Convenience, judgment samples
- Models (usually not explicitly stated)
- Faith
- Claim accuracy because predicted correct winner
in last few elections - But give margin of error
31From pollster.com blogs
- September 10, 2009
- Justification of convenience samples for
estimating population values - Use model-based inference
- See Sharon Lohrs Sampling Design and Analysis
- But what is the model, and how do you know it
fits non-volunteers?
322000s
- Coverage
- Nonresponse
- Measurement error
- Massive amounts of data available
- Networked data
- Multiple sources, linking
- Data fusion
33Danger
- Ready availability of data
- Wilkinson (2008) structural equation software
- Correlational studies ?
- Designed experiments ?
- Designed surveys are important
- Careful data collection
- Inference to population
34New uses for survey data
- Detecting anomalies
- False discovery rates
- Forecasting
- Better survey design
- Combining information from surveys
- From data sampling to data integration
- ????
35New uses for survey methods
- Relationships in massive data sets
- SRS sometimes used, but rarely other designs
- Dynamic data collection
- Data dispersed on servers
- Microarray data
- Effectiveness of medical treatments
- Value added by teachers
36Better connections
- Tukey (1962) The Future of Data Analysis
- It is, incidentally, both surprising and
unfortunate that those concerned with statistical
theory and statistical mathematics have had so
little contact with the recent developments of
sophisticated procedures of empirical sampling.
37Better connections
- Efron (2007) The Future of Statistics
- Statistics is in a period of rapid expansion and
change. During such times, it pays to concentrate
on basics and not tie oneself too closely to any
one technology or analysis fad.
38Training for the Future
- Balanced diet mathematical and statistical
background that will give flexibility - Variety of backgrounds
- Parallels with 1930s
- Economic
- Need for more survey theory, expertise
- Who foresaw probability sampling in 1920?
39Statistics Curriculum
Mathematical Theory
Methodology Regression, Categorical, Time
Series, etc.
40Statistics Curriculum
Mathematical Theory
Methodology Regression, Categorical, Time
Series, etc.
Sampling
41Training for the Future
- Still need
- mathematical theory for statistics
- methodology
- probability and model-based sampling
- But these need to be updated
- Solve problems using statistical thinking
- Integrate theory and practice
- Emphasize data collection
42Socialization
- Better integration of survey sampling with other
courses - Asymptotics, probability
- Computing
- See stat.berkeley.edu/users/statcur
- Some students should learn about
- Machine learning
- Graph and social network theory
- Spatial statistics, bioinformatics,
43Statistics Curriculum
Data Mining
Sampling
Data and Statistical Thinking
Mathematical Theory
DOE
Statistical Methodology
44Species Survival
- Groves Senate Confirmation Hearing, May 15
- Sen. Akaka The federal government is facing
major human capital challenges 45 of current
Census employees will be eligible to retire next
year. - Bob Groves I am terribly worried about this
problem the number of programs in the country
training people that have the requisite skills
for the Census Bureau is way below the need.
45SRMS Distribution
46SRMS Members per Million People
47Morris Hansen
- Born Thermopolis, WY, 1910
- Univ. Wyoming (Deming, Bryant)
- Bachelors degree, accounting, 1934
- Why did he become a survey statistician?
Interview with I. Olkin in Statistical Science,
1987
48Morris Hansen
In accounting I was exposed to courses in
economic statistics by a professor in the
Commerce Department. He was a really fascinating
teacher and got me interested in statistics. When
I finished those courses, I thought I knew
something about statistics and learned later that
was a misconception. But I knew a little and
decided that I would like to go into statistics.
49Teacher Forest R. Hall
- Asst prof, 1927
- Depression Regional Director of Dept of Labor
- 4-state Study of Consumer Purchases
50Propagating the Species
- Data not the plural of anecdote
- But recruitment is anecdotal, personal
- Activities that allow students to experience
importance, excitement of subject - Great teaching
- Sampling in intro stat, graduate curriculum
- Work with survey investigations
- Numerical detectives (B. Joiner)
51Adult Care
- Balanced diet
- Exercise
- Natural sunlight
- Socialization
- Bathing
- Reproduction
- Good teaching
- Collateral reproduction
- High pay