Title: The Statistical Administrative Records System and Administrative Records Experiment 2000: System Design, Successes, and Challenges
1The Statistical Administrative Records System and
Administrative Records Experiment 2000 System
Design, Successes, and Challenges
Dean H. Judson Planning, Research and Evaluation
Division U.S. Census Bureau
2Outline of Presentation
- General principles for using administrative
records properly - Overview of StARS/AREX history, goals and design
- Applications and evaluations StARS 1999 and
StARS 2000 versus Census 2000
3General Principles for Using Administrative
Records Properly
4How Administrative Records Are Created and Used
Policy changes which change the definition of
events and objects
Ontologies and thresholds for observation
Data collection
Data entry errors and coding schemes
Data management issues
Query structure and spurious structure
5Some Important Principles
- Database ? Population !
- Database ? Truth !
- The true Data exist in the real world, as
does the true Population. - But, the database gives us information that
points to the Truth, and points to the Population.
6Oops! Accidentally included contractors!
7Ontologies and Data Quality
Incomplete Representation
Proper Representation
State 1
State 1
State 1
State 1
State 2
State 2
State 2
State 2
State 3
State 3
State 3
State 4
Ambiguous Representation
Meaningless States
State 1
State 1
State 1
State 1
State 2
State 2
State 2
State 2
State 3
State 3
State 4
Data Quality ? The function that maps from real
world to database allows one to reconstruct the
real world from the database values. Source
Wand and Wang, 199690
8Coverage versus Intensity/ContentHow can we get
the best of both?
9A Model for Borrowing Strength
Original DW Database (X)
Ground Truth
Carefully Collected Data (Y)
X
Representative Sample of X
Estimated Model Yf(X)
Augmented DW Database, with X and estimated Ys
10Statistical Administrative Records System and
Administrative Records Experiment
11Background and History
- Statistical Administrative Records System
- Six large Federal input files IRS 1040, IRS
1099, Selective Service, Medicare, Indian Health
Service, HUD-TRACS/MTCS - One lookup file SSA/Census NUMIDENT
- AREX 2000
- Attempt to use StARS data to simulate
administrative records census
12What Was the Purpose of StARS 1999 and AREX 2000?
- Test the feasibility of an administrative records
census - StARS Nationwide
- AREX two counties in Maryland, three in Colorado
- MD 1.4M persons in 558K households
- CO 1.2M persons in 459K households
- Test two methods for conducting an administrative
records census - top-down method
- bottom-up method (match to address list, addtl
operations)
13Can We Do This?
- Title 13, U.S. Code (6, (a)-(c) abridged
- The Secretarymay call upon any other
departmentof the Federal Governmentfor
information pertinent to the work provided for in
this titleTo the maximum extent possible, the
Secretaryshall use such information instead of
conducting direct inquiries - Privacy Act, 1974 (Title 5 6, abridged)
- No agency shall disclose any recordunlessto
the Bureau of the Census for purposes of planning
or carrying out a census or survey or related
title 13 activity - Each agency that maintains a system of records
shallpublish in the Federal Register upon
establishmentthe existence and character of the
system of records (Published StARS in FR ,
January 1999)
14The Statistical Administrative Records System-1999
?
Research
Extraction of AREX Test Site Records 1,459,760 in
Baltimore Site 1,229,274 in Colorado Site
15Statistical Administrative Records System-2000
(DRAFT)
?
16Administrative Records Experiment in 2000 (AREX
2000)
- Five selected sites in Maryland and Colorado
- MD Baltimore city, Baltimore county
- CO El Paso county, Douglas county, Jefferson
county - Attempt to simulate an Administrative Records
Census - Not all aspects of an Administrative Records
Census are simulated - Group Quarters survey
- Coverage measurement survey
- Special operations not included in StARS
- Request for physical address (PO boxes/Rural
Routes) - Clerical hand geocoding
- Field verification of addresses not matched to
DMAF
17AREX 2000 Evaluations
- Process Analyzing selected components of the
AREX implementation processing - Outcomes Block level analysis
Age/Race/Sex/Hispanicity comparisons to Census
2000 - Household level analysis
- Comparing household distributions for matched
addresses - Assessing the feasibility of using administrative
records in lieu of a field interview to obtain
data on nonresponding households - Available at www.census.gov/pred/www/rpts.htmlARE
X - (Synthesis of results from the Administrative
Records Experiment in 2000)
18Characteristics of Files Included in the StARS
System
- IRS Individual Master 1040 File
- Tax year data April, 2000 refers to tax year
1999 - TY 99 file arrives October, 2000
- Business entities, estates, other institutions
included - 120 million return records/year maximum of six
person records per return - Households below the filing threshold do not need
to file - Late filers systematically different than early
filers - Tax Filing Unit ? Housing Unit 10-20 of
addresses are PO Boxes, business addresses, tax
preparers (Czajka, 2000) - TY95 SSNs of dependents requested, recorded
- .5 of primary filer, 1.6 of secondary filer,
3.4 of dependents SSNs in error (Czajka, 1987) - Age, race, sex, Hispanic origin microdata not
available
19Characteristics of Files Included in the StARS
System, cont.
- IRS Information Returns Master File
- Tax year data April, 2000 refers to tax year
1999 - TY 99 file arrives October, 2000
- Business entities, estates, other institutions
included - 700 million records/year
- Recipient address ? Housing Unit
- 10-20 of addresses are PO Boxes, business
addresses, tax preparers - Extremely limited microdata content Age, race,
sex, Hispanic origin microdata not available
name information often truncated - Possible source of information on undocumented
persons
20Characteristics of Files Included in the StARS
System, cont.
- Selective Service File
- Requested 4/1/99(00) file cut date
- 13 million records
- Registration required in 1940, suspended in 1975,
resumed in 1980 - Presumably, males 18-25 are required to inform
SSS when they move - Females, non-immigrant aliens, hospitalized,
incarcerated, and institutionalized males, and
members of the armed forces are exempt - Limited microdata content Race, Hispanic origin
microdata not available - Address information may not be current
21Characteristics of Files Included in the StARS
System, cont.
- Medicare Enrollment Database (EDB)
- Requested 4/1/99(00) file cut date -- current
and historical Medicare enrollment (Active and
Inactive cases) - 40 million records at any one point in time
- Recipient Address ? Housing Unit
- Proxy recipients listed on the file (e.g., John
Does benefits c/o Jane Doe John Does benefits
c/o nursing home) - Used in population estimates system for 65
household population estimates - A small portion of records at any point in time
are almost certainly deceased (Kim and Sater,
2000) - Coverage is high (93-102) but not perfect and
unevenly distributed geographically - Snowbird states appear to have lower ratios of
Medicare to 65 population than non-snowbird
states (Kim and Sater, 2000)
22Characteristics of Files Included in the StARS
System, cont.
- Indian Health Service patient file
- Requested 4/1/99(00) file cut date
- 10 million patient/transaction records
- Transaction record ? person record
- Unduplication
- about 10 million patient records, 2 million
unduplicated SSNs - Many missing SSNs (about 20)
- Integral part of our race model
23Characteristics of Files Included in the StARS
System, cont.
- Housing and Urban Development Tenant Rental
Assistance Certification System (HUD-TRACS/MTCS) - Requested 4/1/99(00) file cut date
- HUD subsidy payments
- TRACS 1999 3.3 million records
- TRACS 2000 2 million records
- Short form data for all members of household
(Race/Hispanic only for head of household) - Address information may represent project or
landlord address
24Characteristics of Files Included in the StARS
System, cont.
- Census NUMIDENT File
- 700 million transaction records ? 400 million
individual SSN records - Post 1985 Enumeration at birth
- For each SSN Date of birth, gender, race, place
of birth - About 50-60 million persons on the file are
deceased but not identified as such - No current residence information on the file
- Taxpayer ID Numbers (TINs) not on the file
- Demographic properties
- About 35 of SSNs on file have alternate names
(marriage, divorce, etc.) - About 6 missing gender
- Race coding has changed (prior to 1980, 3 races
White, Black, Other) 20 either unknown or
other - About 25 of SSNs have transactions with
different race codes
25Creating Final StARS Database
- Select best address and demographics based on
- geocodability
- currency
- quality
- Impute missing demographics (from NUMIDENT/PERSON
CHARACTERISTICS FILE) - Flag records for deceased people
- Final database is like the census
26Address Processing Results (StARS 1999)
- Almost 800 million addresses at start
- About 6 percent identified as potential
businesses - 136 million address records after unduplication
- About 75 percent geocoded
- 85 percent geocoding rate for city-style addresses
27Person Processing Results (StARS 1999)
- 875 million records at start
- 845 million have valid SSN record (96.5)
- 280 million after unduplication by SSN
- 261 million after removal of known deceased
- 257 million after removal of known deceased and
persons residing in outlying territories - StARS 2000 266 million after removal of known
deceased before April 1, 2000 and persons
residing in outlying territories
28Additional Operations of AREX 2000
- Clerical geocoding
- Request for physical address (for P.O. Boxes,
Etc.) - Match to Decennial Master Address File
- Field address verification
-
29Major Analytic Issues with StARS Processing
- Ontologies
- The way in which an administrative agency
defines the world may not match the way the
Census Bureau defines the world, e.g., - A delivery address suitable for receiving a
payment check may not suffice for putting
individuals at a street address - Difficult to distinguish individual units within
the Basic Street Address - Race coding Hispanic Origin is a separate race
on NUMIDENT - Transaction data ? person data
- How many names does a person have (and in what
order)? - Proxies IRS Medicare records
- JOHN WILSON The address is (presumably) for Mary
Smith. John Wilson may or - C/O MARY SMITH may not live there.
- 1004 LAUREL LANE
- ROCKMONT, MD 22345
30Major Analytic Issues with StARS Processing, cont.
- Addresses that are difficult to place on the
ground - About 10 of addresses are rural style
- PO Boxes 45 for IHS, 9.5 for Medicare, 7.5
for IRS 1040, 6.8 for SSS, 3.8 for IRS 1099,
.4 for HUD-TRACS (Huang and Kim, 2000) - 1995 IRS/CPS match 86.5 of tax return cases had
the same address as residence address, 94 coded
to same county (Sater, 1995) - John Smith
- HR BLOCK
- P.O. BOX 12
- GREENWAY, MD 29752
- Addresses with both business and residential
components - Dean H. Judson
- JUDSON OLD GROWTH LOGGING SERVICES
- 45850 BACKWOODS HIGHWAY
- BOONDOCKS, OR 96432
31Major Analytic Issues with StARS Processing, cont.
- Unduplication and matching
- Addresses and personal characteristics are
measured with substantial variation - Often not obvious whether a particular pair of
records represent a duplicate or not. - Yet, with multiple files, unduplication decisions
must be made. - Address matching
- 101 Elm Rd, 1 97132
- 101 Elm St, apt 1 97701
- Versus
- 101 Elm Rd, 1 97132
- 101 Elm St, apt 1 97132
32Major Analytic Issues with StARS Processing, cont.
- Variations in data from different sources
- Of the 50 of SSNs found on multiple files,
- about 1 have more than one gender recorded
- about 32 have multiple addresses
- about 2 have multiple races (Huang and Kim,
2000) - Imputation from the NUMIDENT
- Many files have limited microdata. For those that
are found on the NUMIDENT, we can impute
microdata from the approximately equivalent
NUMIDENT fields. - Race Model (Bye, 1998,1999)
- Gender Model (Thompson, 1999)
- Mortality Model (Falkenstein, Resnick, and
Judson, 2000) - StARS 2002 NUMIDENT Race Enhancement
- Match NUMIDENT to Census 2000
- Use Census 2000 race response to improve
imputation model
33Major Analytic Issues with StARS Processing, cont.
- Changing information states
- Distinct problem from point in time data
collection - Information states change over time/over
databases - Address information ages over time and varies
over databases - SAM SMITH SAM SMITH
- BOX 2 RURAL ROUTE 37 486 MAIN STREET
- WESTPORT, VA 32784 FAIRFIELD, VA 33412
- (Dated 10/14/98 from Medicare) (From TY97 IRS
file, filed sometime in 1998) - Mortality information ages over time and varies
over databases - One database provides information about the
other, provided that matching can be performed - Data processing requires complex, and
substantively important, decision logic at each
step
34Applications and Evaluations
35Applications
- SSN search and validation with GEOkey
- Earlier 90 found in validation step, 5 in
search step - 2001 Evaluation 92 found in search (with
GEOkey) alone - Apparently, our computer search outperforms SSA
manual system - CPS/NHIS/ACS to Census matching evaluations
- Compare different race responses
- Compare survey and Census coverage
- Compare variations in Poverty estimates
- Evaluation of synthetic estimation methods
(Popoff, Judson and Fadali, 2001) - Multiple-system Estimation for coverage
evaluation - Additional information to aid dual-system
estimation (Asher and Feinberg, 2001) - Erroneous enumerations (Biemer, Brown, Wiesen,
and Judson, 2001)
36Applications
- Nonresponse follow up (NRFU) substitution (04
simulation test) - Imputation methods improvement (04 simulation
test) - Master Address File (MAF) targeting
- Census unduplication confirmation
- Population estimation (postcensal estimates)
- Survey improvement (noninterview adjustments)
37Evaluations
- Numident/PCF 1998 versus 1998 National estimates
(Miller, Judson and Sater, 2000) - State level comparisons of StARS 2000 versus
Census 2000 - County StARS-synthetic methods versus county
ratio estimates and Census 2000 - Detailed comparison by (fully crossed) age, race,
sex, and Hispanic origin counts versus Census
2000, at the county level - AREX tract, block, household evaluations on
February 19th
38Numident/PCF 1998 versus 1998 National Estimates
39Numident/PCF 1998 versus 1998 National Estimates
40State Level Comparisons of Census 2000 to StARS
2000
41County StARS-synthetic Methods versus 1999
Estimates
42County StARS-synthetic methods versus 1999
Estimates versus Census 2000
Hispanic (StARS 99 vs. 99 Estimates vs. Census
2000, selected
counties where StARS and Estimates deviate by
more than 4
percentage points, counties in Colorado)
90
80
70
60
StARS 99
50
Census 2000
40
99 Estimates
30
20
10
Counties in
0
which StARS 99
Bent
Otero
is closer to
Kiowa
Chaffee
Morgan
Pueblo
Costilla
Garfield
Lincoln
Mineral
Phillips
Conejos
Crowley
Fremont
La Plata
Alamosa
Huerfano
Archuleta
San Juan
Saguache
Las Animas
Census 2000
are marked with
a star.
43Fully crossed age, race, sex, and Hispanic Origin
array(ARSH array)
- For every county in the U.S., count the number of
nondeceased persons by - Single year of age (0,101)
- Race (four groups)
- Sex (two groups)
- Hispanic origin (Hispanic/non)
- Potentially 102 x 4 x 2 x 2 1632 cells per
county, 3141x1632 5,126,112 in the U.S. - Error Measures
- Simple difference (C-S)
- Algebraic percent error (S-C)/C
44Note Each data point is a single countys ARSH
cell.
45Note Each data point is a single countys ARSH
cell.
46Age/Sex distributions, selected counties in Texas
Anderson County (N of Houston)
Andrews County (Far west, NM border)
Brazos County (W of Houston)
Atascosa County (Southern part of state)
47Concluding Thoughts
- Historians of science will say that there was an
explosion of research into Administrative
Records and Data Warehousing in the late
20th/early 21st century - Using these databases in a statistically-principle
d way requires a new statistical paradigm - Not survey sampling per se
- Not econometric modeling per se
- Not coverage measurement per se
- Something new
- These databases have some similar, but many
different data quality issues than usual survey
or census data - We are attacking these issues with real Census
applications
48For Further Reading
- Alvey, W., and Scheuren, F. (1982). Background
for an Administrative Records Census. Proceedings
of the Social Statistics Section. Alexandria,
VA American Statistical Association. - Asher, J., and Feinberg, S. (2001). Statistical
Variations on an Administrative Records Census.
Proceedings of the Social Statistics Section.
Alexandria, VA American Statistical Association. - Biemer, P., Brown, G., Weisen, C., and Judson,
D.H. (2001). Triple system estimation in the
presence of erroneous enumerations. Proceedings
of the Social Statistics Section. Alexandria,
VA American Statistical Association. Under
review at the Journal of Official Statistics. - Bye, B. (1997). Administrative Record Census for
2010 Design Proposal, Final Report. Rockville,
MD Westat, Inc. - Bye, B. (1998). Race and ethnicity modeling with
SSA Numident Data Interim report File
development and tabulations. Unpublished
document available from the U.S. Bureau of the
Census. - Bryant, C. (1995). Comparing the LUCA address
list to local records. Paper presented at the
1995 State Data Center Meeting, San Francisco,
CA, April 4, 1995. - Czajka, J., Moreno, L., and Schirm, A.L. (1997).
On the Feasibility of Using Internal Revenue
Service Records to Count the U.S. Population.
Washington, DC Mathematica Policy Research, Inc. - Czajka, J. (1999). Can we count on administrative
records in future U.S. Censuses? Presentation at
the Bureau of the Census, December 15, 1999. - Falkenstein, Matthew, Resnick, Dean R., and
Judson, Dean. H. (2000). The Mortality Module of
the Statistical Administrative Records System.
Administrative Records Memorandum Series, U.S.
Census Bureau. - Farber, Jim, and Shaw, Kevin M. (2002). Dual
System Estimates of Housing Units Based on
Administrative Records. To appear in the 2002
Proceedings of the American Statistical
Association, Government Statistics Section
CD-ROM, Alexandria, VA American Statistical
Association. - Heimovitz, Harley K (2002). Administrative
Records Experiment 2000 Outcomes. To appear in
the 2002 Proceedings of the American Statistical
Association, Government Statistics Section
CD-ROM, Alexandria, VA American Statistical
Association. - Huang, E., and Kim, J. (2000). One Percent
Sample Study Report (SRD-DRAFT). Unpublished
document available from the U.S. Bureau of the
Census, February 10, 2000.
49For Further Reading
- Judson, D.H., and Popoff, C.L. (2000). Research
Use of Administrative Records. University of
Nevada Nevada State Demographers Office. - Judson, D. H. (2000). The Statistical
Administrative Records System System Design,
Successes, and Challenges. Paper presented at the
2000 Data Quality Workshop, Morristown, NJ, Nov
30-Dec 1. - Judson, D.H., Popoff, Carole L., and Batutis,
Michael (2001). An Evaluation of the Accuracy of
U.S. Census Bureau County Population Estimation
Methods. Statistics in Transition, 5185-215. - Judson, D.H. (2001). A Partial Order Approach to
Record Linkage. Paper presented at the Federal
Committee on Statistical Methodology,
Washington, DC, November 14, 2001. - Judson, D.H. (2002). Adventures in Bayesian
Record Linkage. Paper presented at the
Classification Society of North America, June 11,
2002. - Judson, Dean H. (2002). Merging Administrative
Records Databases in the Absence of a Register
Data Quality Concerns and Outcomes of an
Experiment in Administrative Records Use. Paper
presented at the UNECE-EUROSTAT work session on
registers and administrative records in social
and demographic statistics, Geneva, Switzerland,
9-11 December 2002). - Kim, M. O., and Sater, D. (2000). Defining the
Medicare Data Universe for the U.S. Census
Bureau's Population Estimates Program. Paper
presented at the Southern Demographic Association
meetings, New Orleans, LA, August 29, 2000. - Leggieri, Charlene, and Prevost, Ron (1999).
Expansion Of Administrative Records Uses At The
Census Bureau A Long-Range Research Plan. Paper
presented at the November 1999 Meeting of the
Federal Committee on Statistical Methodology,
Washington D.C. - Miller, E., Judson, D.H., and Sater, D. (2000).
The 100 Census NUMIDENT Demographic Analysis of
Modeled Race and Hispanic Origin Estimates Based
Exclusively on Administrative Records Data, Paper
presented at the Southern Demographic Association
meetings, New Orleans, LA, August 29, 2000. - Popoff, C.L., Judson, D.H., and Fadali, Betsy
(2001). Measuring the Number of People Without
Health Insurance A Test of a Synthetic Estimates
Approach for Small Area Estimates using SIPP
Microdata. Paper presented at the Federal
Committee on Statistical Methodology,
Washington, DC, November 14, 2001.
50For Further Reading
- Sailer, P., Weber, M., and Yau, E. (1993). How
Well Can IRS Count the Population? 1993
Proceedings of the Survey Research Methods
Section. Alexandria, VA American Statistical
Association. - Sater, D. (1995). Differences in Location of
Households and Tax Filing Units. Paper presented
at the 1995 meeting of the Population Association
of America, San Francisco, CA, April 6, 1995. - Stuart, E. and Zaslavsky, A.M. (2002). Using
administrative records to predict census day
residency. In Constantine Gatsonis, Robert E.
Kass, Alicia Carriquiry, Andrew Gelman, David
Higdon, Donna K. Pauler, Isabella Verdinelli
(Eds.), Case Studies in Bayesian Statistics
Volume VI. New York, NY Springer. - Thompson, Herbert (1999). The Development of a
Gender Model with SSA Numident Data.
Administrative Records Research Memorandum Series
32, U.S. Census Bureau. - Wand, Y., and Wang, R. Y. (1996). Anchoring data
quality dimensions in ontological foundations.
Communications of the ACM, 39 86-95. - Zanutto, Elaine, and Zaslavsky, Alan M. (2001).
Using Administrative Records to Impute for
Nonresponse. In R. Groves, R.J.A. Little, and
J.Eltinge (Eds), Survey Nonresponse. New York
John Wiley.
51Glossary of Terms
- Administrative records Data collected wherein
the primary purpose is to administer a regulation
or record a transaction rather than data
collection per se. - Administrative Records Census A Census of
Population and Housing in which a predominant
component of the census-taking is performed by
using administrative records databases. In
practice, field operations (for example, for
coverage measurement or for Group Quarters
enumeration) often coincide. - AREX2000 Administrative Records Experiment in
2000, an experimental attempt to simulate an
Administrative Records Census in two sites in
the U.S. - Basic Street Address The primary street number
and street name, omitting apartment numbers or
other within-structure identifiers. - CPS Current Population Survey, an ongoing survey
administered by the U.S. Census Bureau. - Data Quality The ability to construct a mapping
from the ontological representation of a data
item in a database to its appropriate ontological
representation in the real world. - Master Address File (MAF) A file of addresses
maintained by the U.S. Census Bureau for the
purpose of taking its decennial census, and
acting as a frame for ongoing sample surveys.
The Decennial Master Address File is referred to
as the DMAF. - Master Housing File A file of addresses
developed by the Statistical Administrative
Records System. - Microdata Data on individual person or housing
characteristics, i.e., race, sex, age, street
address, zip code. - Ontology The study of what is, that is, the
categories by which we understand the world. - StARS Statistical Administrative Records System,
an experimental database that combines
information from several major Federal databases
into one database that can be used for
census-taking purposes.