Record matching for census purposes in the Netherlands PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Record matching for census purposes in the Netherlands


1
Record matching for census purposes in the
Netherlands
  • Eric Schulte Nordholt
  • Senior researcher and project leader of the
    Census
  • Statistics Netherlands
  • Division Social and Spatial Statistics
  • Department Support and Development
  • Section Research and Development
  • ESLE_at_CBS.NL
  • Joint UNECE/Eurostat Meeting on Population and
    Housing Censuses in Astana
  • 4-6 June 2007

2
Contents
  • History of the Dutch Census
  • Data sources
  • Micro linkage
  • Micro integration
  • Social Statistical Database
  • Estimation aspects
  • Statistical confidentiality
  • Conclusions

3
History of the Dutch Census
  • TRADITIONAL CENSUS
  • Ministry of Home Affairs
  • 1829, 1839, 1849, 1859, 1869, 1879 and 1889
  • Statistics Netherlands
  • 1899, 1909, 1920, 1930, 1947, 1960 and 1971
  • Unwillingness (nonresponse) and reduction
    expenses ? no more Traditional Censuses
  • ALTERNATIVE VIRTUAL CENSUS
  • 1981 and 1991 Population Register and surveys
  • development 90s more registers ?
  • 2001 integrated set of registers and surveys,
    SSD

4
Data sources
  • Registers
  • Population Register (PR), 16 million records
    demographic variables sex, age, household status
    etc.
  • Jobs file, employees, 6.5 million records, and
    self-employed persons, 790 thousand records
    dates of job, branch of economic activity
  • Fiscal administration (FIBASE) jobs, 7.2
    million records, and pensions and life
    insurance benefits, 2.7 million records
  • Social Security administrations, 2 million
    records, auxiliary information integration
    process
  • Surveys
  • Survey on Employment and Earnings (SEE), 3
    million records, working hours, place of work
  • Labour Force Survey (LFS), 2 years 230.000
    records education, occupation, (economic)
    activity

5
Matching process
  • Matching of registers and datasets to a self
    constructed Central Matching File
  • Records are identified by a surrogate identifier
    (RIN)
  • One unique table RIN-Social Security Number
  • Minimal set of identifying variables
  • Every step in the process is a deterministic
    match

6
Statistics Netherlands backbone of persons
The Central Matching File (April 2007) 46.436.060 records 16.334.210 unique persons The Central Matching File (April 2007) 46.436.060 records 16.334.210 unique persons
Social security number (sofi) lt 0.03 unknown for 1995-2007
Date of birth lt 0.5 unknown month and/or day
Gender always
Postal code lt 0.05 unknown
House number lt 0.05 unknown
RIN Person always
RIN Address always
Time frame of variable validity always
7
Matching process
  • Social security number matchingCheck on date of
    birth and genderA valid match when no more than
    one of the variables year, month, day of birth
    and gender differ
  • else
  • Matching using other variables like postal code,
    house number, date of birth, gender All keys
    must match
  • else
  • Match on social security number without any
    control on other variables

8
Micro data with Surrogate Identifier
production environment SN
Municipal Population Register
Micro data Services Social Statistics Database
Micro data Preparation and documentation
Registers
Surveys
de-identified micro data
Direct Identifier
Surrogate Identifier (RIN)
9
Example
Employement and Wages survey 2003 Employement and Wages survey 2003 Employement and Wages survey 2003 Employement and Wages survey 2003 Employement and Wages survey 2003 Employement and Wages survey 2003 3801246 100,0
Total matched Total matched Total matched 3747976 98,6
1 Sofi number, year of birth, month, day, gender Sofi number, year of birth, month, day, gender Sofi number, year of birth, month, day, gender Sofi number, year of birth, month, day, gender Sofi number, year of birth, month, day, gender Sofi number, year of birth, month, day, gender Sofi number, year of birth, month, day, gender 3577090 94,1
2 Postal code, year of birth, month, day, gender Postal code, year of birth, month, day, gender Postal code, year of birth, month, day, gender Postal code, year of birth, month, day, gender Postal code, year of birth, month, day, gender Postal code, year of birth, month, day, gender Postal code, year of birth, month, day, gender 164267 4,3
3 Sofi number Sofi number Sofi number Sofi number 6619 0,2
Not matched Not matched Not matched 53270 1,4
Valid sofi number Valid sofi number Valid sofi number Valid sofi number Valid sofi number Valid sofi number 21194 0,6
valid postal code valid postal code valid postal code 5799 0,2
invalid postal code invalid postal code invalid postal code 10294 0,3
non-resident non-resident non-resident 5101 0,1
Unknown or invalid sofi number Unknown or invalid sofi number Unknown or invalid sofi number Unknown or invalid sofi number Unknown or invalid sofi number Unknown or invalid sofi number 32076 0,8
valid postal code valid postal code valid postal code 8718 0,2
invalid postal code invalid postal code invalid postal code 20052 0,5
non-resident non-resident non-resident 3306 0,1
10
Micro integration (1)
  • The aim of micro integration is
  • To check the linked data and modify incorrect
    records,
  • In such a way that the results that are to be
    published are of higher quality than the original
    sources

11
Micro integration (2)
  • To fulfil this demand an integrated process of
  • data editing,
  • derivation of statistical variables,
  • and imputation
  • is executed

12
Micro integration (3)
  • Constraints and limitations
  • Only variables that are to be published are micro
    integrated
  • Identity rules are necessary, e.g. the same
    variable in two sources or a relationship between
    two or more variables in one or more sources
  • No mass imputation

13
Social Statistical Database (SSD)
  • Social Statistical Database (SSD) Set of
    integrated microdata files with coherent and
    detailed demographic and socio-economic data on
    persons, households, jobs and benefits
  • No remaining internal conflicting information
  • SSD set
  • Population Register (backbone)
  • Integrated jobs file
  • Integrated file of (social and other) benefits
  • Surveys, e.g. LFSCombining element RIN-person

14
Core and satellites (1)
satellite
satellite
satellite
satellite
SSD-core
satellite
satellite
satellite
satellite
15
Core and satellites (2)
  • Core
  • contains only integral register information
  • contains the most important demographic and
    socio-economic information
  • contains only information that is used in at
    least two satellites

16
Core and satellites (3)
  • Satellites are produced in two steps
  • Copying and derivation of the relevant
    information from the core SSD
  • Adding of the unique information on a specific
    theme from registers and surveys

17
Conclusions SSD
  • The SSD diminishes the administrative burden
  • The SSD increases
  • The efficiency of statistics production
  • The accuracy of statistical outputs
  • The relevance of social statistics
  • The possibilities for social policy research

18
Estimation aspects
  • Surveys are samples from the population
  • If surveys are enriched with register
    information, estimations of the register part of
    the enriched survey will lead to inconsistencies
    with the counts from the entire register
  • Statistics Netherlands developed the method of
    consistent and repeated weighting to solve these
    inconsistencies

19
Statistical confidentiality
IDs Variables
Characteristics
Administrative sources
Identifiers (PINs, sex, date of birth, address)
IDs Variables
Household surveys
PERSONS BACKBONE full range of all persons as
from 1995
IDs in sources are replaced by random Record
Identification Numbers (RINs)
20
Conclusions
  • Matching is relatively cheap
  • Matching is relatively quick (short production
    time)
  • Micro integration remains important
  • The SSD has found its place in the organisation
  • Repeated weighting method guarantees consistent
    estimates
  • Statistical confidentiality aspects have become
    very important

21
Time for questions and discussion
Thank you for your attention!
Write a Comment
User Comments (0)
About PowerShow.com