Title: Record matching for census purposes in the Netherlands
1Record matching for census purposes in the
Netherlands
- Eric Schulte Nordholt
- Senior researcher and project leader of the
Census - Statistics Netherlands
- Division Social and Spatial Statistics
- Department Support and Development
- Section Research and Development
- ESLE_at_CBS.NL
- Joint UNECE/Eurostat Meeting on Population and
Housing Censuses in Astana - 4-6 June 2007
2Contents
- History of the Dutch Census
- Data sources
- Micro linkage
- Micro integration
- Social Statistical Database
- Estimation aspects
- Statistical confidentiality
- Conclusions
3History of the Dutch Census
- TRADITIONAL CENSUS
- Ministry of Home Affairs
- 1829, 1839, 1849, 1859, 1869, 1879 and 1889
- Statistics Netherlands
- 1899, 1909, 1920, 1930, 1947, 1960 and 1971
- Unwillingness (nonresponse) and reduction
expenses ? no more Traditional Censuses - ALTERNATIVE VIRTUAL CENSUS
- 1981 and 1991 Population Register and surveys
- development 90s more registers ?
- 2001 integrated set of registers and surveys,
SSD
4Data sources
- Registers
- Population Register (PR), 16 million records
demographic variables sex, age, household status
etc. - Jobs file, employees, 6.5 million records, and
self-employed persons, 790 thousand records
dates of job, branch of economic activity - Fiscal administration (FIBASE) jobs, 7.2
million records, and pensions and life
insurance benefits, 2.7 million records - Social Security administrations, 2 million
records, auxiliary information integration
process - Surveys
- Survey on Employment and Earnings (SEE), 3
million records, working hours, place of work - Labour Force Survey (LFS), 2 years 230.000
records education, occupation, (economic)
activity
5Matching process
- Matching of registers and datasets to a self
constructed Central Matching File - Records are identified by a surrogate identifier
(RIN) - One unique table RIN-Social Security Number
- Minimal set of identifying variables
- Every step in the process is a deterministic
match
6Statistics Netherlands backbone of persons
The Central Matching File (April 2007) 46.436.060 records 16.334.210 unique persons The Central Matching File (April 2007) 46.436.060 records 16.334.210 unique persons
Social security number (sofi) lt 0.03 unknown for 1995-2007
Date of birth lt 0.5 unknown month and/or day
Gender always
Postal code lt 0.05 unknown
House number lt 0.05 unknown
RIN Person always
RIN Address always
Time frame of variable validity always
7Matching process
- Social security number matchingCheck on date of
birth and genderA valid match when no more than
one of the variables year, month, day of birth
and gender differ - else
- Matching using other variables like postal code,
house number, date of birth, gender All keys
must match - else
- Match on social security number without any
control on other variables
8Micro data with Surrogate Identifier
production environment SN
Municipal Population Register
Micro data Services Social Statistics Database
Micro data Preparation and documentation
Registers
Surveys
de-identified micro data
Direct Identifier
Surrogate Identifier (RIN)
9Example
Employement and Wages survey 2003 Employement and Wages survey 2003 Employement and Wages survey 2003 Employement and Wages survey 2003 Employement and Wages survey 2003 Employement and Wages survey 2003 3801246 100,0
Total matched Total matched Total matched 3747976 98,6
1 Sofi number, year of birth, month, day, gender Sofi number, year of birth, month, day, gender Sofi number, year of birth, month, day, gender Sofi number, year of birth, month, day, gender Sofi number, year of birth, month, day, gender Sofi number, year of birth, month, day, gender Sofi number, year of birth, month, day, gender 3577090 94,1
2 Postal code, year of birth, month, day, gender Postal code, year of birth, month, day, gender Postal code, year of birth, month, day, gender Postal code, year of birth, month, day, gender Postal code, year of birth, month, day, gender Postal code, year of birth, month, day, gender Postal code, year of birth, month, day, gender 164267 4,3
3 Sofi number Sofi number Sofi number Sofi number 6619 0,2
Not matched Not matched Not matched 53270 1,4
Valid sofi number Valid sofi number Valid sofi number Valid sofi number Valid sofi number Valid sofi number 21194 0,6
valid postal code valid postal code valid postal code 5799 0,2
invalid postal code invalid postal code invalid postal code 10294 0,3
non-resident non-resident non-resident 5101 0,1
Unknown or invalid sofi number Unknown or invalid sofi number Unknown or invalid sofi number Unknown or invalid sofi number Unknown or invalid sofi number Unknown or invalid sofi number 32076 0,8
valid postal code valid postal code valid postal code 8718 0,2
invalid postal code invalid postal code invalid postal code 20052 0,5
non-resident non-resident non-resident 3306 0,1
10Micro integration (1)
- The aim of micro integration is
- To check the linked data and modify incorrect
records, - In such a way that the results that are to be
published are of higher quality than the original
sources
11Micro integration (2)
- To fulfil this demand an integrated process of
- data editing,
- derivation of statistical variables,
- and imputation
- is executed
12Micro integration (3)
- Constraints and limitations
- Only variables that are to be published are micro
integrated - Identity rules are necessary, e.g. the same
variable in two sources or a relationship between
two or more variables in one or more sources - No mass imputation
13Social Statistical Database (SSD)
- Social Statistical Database (SSD) Set of
integrated microdata files with coherent and
detailed demographic and socio-economic data on
persons, households, jobs and benefits - No remaining internal conflicting information
- SSD set
- Population Register (backbone)
- Integrated jobs file
- Integrated file of (social and other) benefits
- Surveys, e.g. LFSCombining element RIN-person
14Core and satellites (1)
satellite
satellite
satellite
satellite
SSD-core
satellite
satellite
satellite
satellite
15Core and satellites (2)
- Core
- contains only integral register information
- contains the most important demographic and
socio-economic information - contains only information that is used in at
least two satellites
16Core and satellites (3)
- Satellites are produced in two steps
- Copying and derivation of the relevant
information from the core SSD - Adding of the unique information on a specific
theme from registers and surveys
17Conclusions SSD
- The SSD diminishes the administrative burden
- The SSD increases
- The efficiency of statistics production
- The accuracy of statistical outputs
- The relevance of social statistics
- The possibilities for social policy research
18Estimation aspects
- Surveys are samples from the population
- If surveys are enriched with register
information, estimations of the register part of
the enriched survey will lead to inconsistencies
with the counts from the entire register - Statistics Netherlands developed the method of
consistent and repeated weighting to solve these
inconsistencies
19Statistical confidentiality
IDs Variables
Characteristics
Administrative sources
Identifiers (PINs, sex, date of birth, address)
IDs Variables
Household surveys
PERSONS BACKBONE full range of all persons as
from 1995
IDs in sources are replaced by random Record
Identification Numbers (RINs)
20Conclusions
- Matching is relatively cheap
- Matching is relatively quick (short production
time) - Micro integration remains important
- The SSD has found its place in the organisation
- Repeated weighting method guarantees consistent
estimates - Statistical confidentiality aspects have become
very important
21Time for questions and discussion
Thank you for your attention!