Title: Nicoletta Cibella1, GervasioLuis Fernandez2, Marco Fortini1,
1NTTS 2009 Brussels 18-20 February 2009
Sharing Solution for Record Linkage the RELAIS
Software and the Italian and Spanish Experiences
- Nicoletta Cibella1, Gervasio-Luis Fernandez2,
Marco Fortini1, - Miguel Guigò2, Francisco Hernandez2, Monica
Scannapieco1, - Laura Tosco1, Tiziana Tuoto1
- 1Italian National Statistical Institute ISTAT
Italy - 2Spanish National Statistical Institute INE
Spain
2Theory and Practice in Developing a Record
Linkage Software
NTTS 2009 Brussels 18-20 February 2009
Outline
- The Record Linkage
- The ESSnet on ISAD
- The Idea and the Features of the RELAIS Software
- The Italian and Spanish Experiences in using
RELAIS - Throughout RELAIS 2.0
- Conclusions
Nicoletta Cibella, Brussels, 19th February
2009
3Record Linkage
NTTS 2009 Brussels 18-20 February 2009
The record linkage purpose is to identify the
same real world entity, which can be differently
represented in data sources Different approaches
to deal with record linkage Exact RL -
Deterministic RL - Probabilistic RL (Fellegi and
Sunter theory) - Bayesian RL - Machine Learning -
Knowledge Representation No particular
technique has emerged as the best solution for
all cases (maybe because such a solution does
not exist)
Nicoletta Cibella, Brussels, 19th February
2009
4Record Linkage Complexity
NTTS 2009 Brussels 18-20 February 2009
The record linkage techniques are a
multidisciplinary set of methods and practices
- DECISION MODEL CHOICE
- Fellegi Sunter
- Deterministic
- Bayesian
- Knowledge based
- Mixed
-
- SEARCH SPACE REDUCTION
- Sorted Neighbourhood Method
- Blocking
- Hierarchical Grouping
-
......
RECORD LINKAGE
......
......
- COMPARISON FUNCTION CHOICE
- Exact
- Edit distance
- Smith-Waterman
- Q-grams
- Jaro string comparator
- Soundex code
- TF-IDF
-
- PRE-PROCESSING
- Conversion of upper/lower cases
- Replacement of null strings
- Standardization
- Parsing
Nicoletta Cibella, Brussels, 19th February
2009
5The Record Linkage Phases
NTTS 2009 Brussels 18-20 February 2009
- Record Linkage should be decomposed in its
constituting phases as much as possible - 1. Pre-processing of the input files
- Creation-Reduction of the search space of link
candidate pairs - Choice of the matching variables
- 4. Choice of the comparison function
- 5. Choice of the decision model
- 6. Selection of unique links
- 7. Record linkage evaluation
Nicoletta Cibella, Brussels, 19th February
2009
6The ESSnet ISAD Integration of Surveys and
Administrative Data
NTTS 2009 Brussels 18-20 February 2009
The ESSnet and its focus The aim of the project
is to arise, in the whole ESS, knowledge and
understanding of the statistical methodologies
for the integration of two (or more) data
sources. Partners The ESSnet ISAD, cofinanced
by Eurostat, started December 2006 and ended June
2008. The project involved 5 countries ISTAT
Italy (scientific coordinator) STAT
Austria CZSO Czech Republic CBS
Netherlands INE Spain
Nicoletta Cibella, Brussels, 19th February
2009
7RELAIS The Idea
NTTS 2009 Brussels 18-20 February 2009
- There is not a unique optimal solution for
solving record linkage problems for each phase
the most appropriate technique should be chosen - depending on application and data requirements,
not only on the practitioners skill - Ad-hoc record linkage process (workflow) should
be dynamically built - RELAIS (REcord Linkage At IStat)
- is a toolkit serving such a purpose
Nicoletta Cibella, Brussels, 19th February
2009
8Record Linkage Workflows
NTTS 2009 Brussels 18-20 February 2009
RecLink WF Appl1
Preprocessing
UpperLowerCase
Normalization
UpperLowerCase
RecLink WF Appl2
Normalization
Schema reconciliation
SNM
Search Space Reduction
Blocking
Blocking
SNM
Comparison Function
Equality
Edit Distance
Jaro
Jaro
Equality
Decision Model
Probabilistic
Probabilistic
Empirical
Empirical
Nicoletta Cibella, Brussels, 19th February
2009
9RELAIS Features
NTTS 2009 Brussels 18-20 February 2009
- Modular structure each phase is planned as a
module of the toolkit, with an explicit interface
with the other modules - Top-down design this allows to omit and/or
iterate modules (phases) of the record linkage
process - Advantages
- dynamic composition of record linkage processes
- parallel development of various techniques is
allowed - design for Web service encapsulation in order to
permit remote invocation
Nicoletta Cibella, Brussels, 19th February
2009
10RELAIS An Open Source Project
NTTS 2009 Brussels 18-20 February 2009
- Results produced by the scientific community in
the last years can be gathered and made available
- 175 000 papers mentioning record linkage
(Google Scholar) - Techniques for each phase can be implemented and
maintained very rapidly by relying on a community
of developers - RELAIS Implementation Choices
- Java
- R statistical language
Nicoletta Cibella, Brussels, 19th February
2009
11RELAIS the First Release
NTTS 2009 Brussels 18-20 February 2009
- SEARCH SPACE REDUCTION
- Cross Product
- Sorted Neighbourhood Method
- Blocking
- 11 REDUCTION
- Optimised Transportation Problem
RELAIS 1.0
- COMPARISON FUNCTION CHOICE
- Equality
- DECISION MODEL CHOICE
- Fellegi Sunter
Nicoletta Cibella, Brussels, 19th February
2009
12RELAIS the First Release
NTTS 2009 Brussels 18-20 February 2009
Nicoletta Cibella, Brussels, 19th February
2009
13RELAIS in the Italian and Spanish Experiences
NTTS 2009 Brussels 18-20 February 2009
- Common ideas and needs about the software (no
ad-hoc solutions) - Sharing knowledge and cooperation started in the
ESSnet - Evaluation of the RELAIS adaptability in order
to solve also Spanish data integration problems
Nicoletta Cibella, Brussels, 19th February
2009
14RELAIS in the Italian Tests
NTTS 2009 Brussels 18-20 February 2009
- A Scenario the Data
- Individuals data from the 2001 Italian Census and
PES (about 180 000 each ones). - Capture-recapture model to estimate Census
Coverage Rate, - - no matching errors in linking Census and PES
records. - Linkage was a very complex operation
- deterministic and probabilistic approaches and
clerical review - almost 15 matching variables
- several working months.
- Due to the accuracy of the matching procedures
adopted, we know the true linkage status of all
candidate pairs.
Nicoletta Cibella, Brussels, 19th February
2009
15RELAIS in the Italian Tests
NTTS 2009 Brussels 18-20 February 2009
A focus on Rome Size of PES and CEN files about
8 000 units each ones Cartesian Product CENxPES
more than 72 250 000 pairs (Expected link
probability 0.0001) 1 Linkage Pass Blocking
on month of birth of the household header
variable Matching Variables name, surname,
gender, day-month-year of birth
Nicoletta Cibella, Brussels, 19th February
2009
16RELAIS in the Italian Tests
NTTS 2009 Brussels 18-20 February 2009
Results of 1 Linkage Step
Match Rate 88 False Match Rate 0.5 False
Non-Match Rate 12 The software also provides
results at the block-level MATCH RATE TOO LOW IN
COVERAGE CONTEXT
Nicoletta Cibella, Brussels, 19th February
2009
17RELAIS in the Italian Tests
NTTS 2009 Brussels 18-20 February 2009
2 Linkage Pass Residuals of the 1 step about 1
500 units each file - mainly composed by records
with missing value in the blocking variable at
the 1 step expected-link probability
0.0003 Cartesian Product again not recommended
Blocking procedure by means of Sorted
Neighborhoods Method Sorting variable first
letter of surname window size 450 (frequency
of the most common first letter 250 ) Matching
Variables name, surname, day-month-year of birth
Nicoletta Cibella, Brussels, 19th February
2009
18RELAIS in the Italian Tests
NTTS 2009 Brussels 18-20 February 2009
Theory and Practice in Developing a Record
Linkage Software
Results of the Overall Linkage Procedure (1 plus
2 steps)
Match Rate 98.5 False Match Rate 0.8 False
Non-Match Rate 2.3 Working Time less than 2
hours
Nicoletta Cibella, Brussels, 19th February
2009
19RELAIS in the Italian Tests
NTTS 2009 Brussels 18-20 February 2009
Theory and Practice in Developing a Record
Linkage Software
Rome PES Workflow
RELAIS 1.0
Search Space Reduction
Blocking
SNM
Cross Product
Comparison Function
Edit Distance
Jaro-Winkler
Equality
Decision Model
Probabilistic
SNM
Linking Type
11
ManyMany
Equality
Probabilistic
11
Step 2
20RELAIS in the Spanish Tests
NTTS 2009 Brussels 18-20 February 2009
- A Scenario the Data
- Individuals data from Living Conditions Survey
(LCS) and Central Population Register (CPR) - 1st Main Objective obtain ID number for LCS
- 2nd Main Objective compare the RELAIS results
with ad-hoc procedures - Linkage was a very complex operation
- only name and geographical variables were
available - large amount of data.
- Blocking on geographic areas variables
Nicoletta Cibella, Brussels, 19th February
2009
21RELAIS in the Spanish Tests
NTTS 2009 Brussels 18-20 February 2009
- Weaknesses of the RELAIS 1.0
- difficulties in managing great amount of blocks
- difficulties in dealing with different
probability estimations in each block - difficulties in writing the largest output files
- Strengths of the RELAIS 1.0
- efficacy of the implemented probabilistic method
- noticeable flexibility in modify/adapt the
implemented functionalities (reduction from MN
to 11)
Nicoletta Cibella, Brussels, 19th February
2009
22Throughout RELAIS 2.0
NTTS 2009 Brussels 18-20 February 2009
Theory and Practice in Developing a Record
Linkage Software
- A relational database architecture in order to
optimize the performances with respect to the
management of huge amount of data through the
whole record linkage process (input, intermediate
phase and output). - Several distance functions for string and
numerical comparisons (not only the equality
one). - Exact and deterministic decision models to be
used either as alternatives or in conjunction
with the probabilistic model. - A data profiling phase to help the user in the
critical phases of choosing the best blocking or
matching variables. - One-shot Execution to deal with a large amount of
blocks. -
- RELAIS 2.0 is now on testing and will be
available from May 2009
Nicoletta Cibella, Brussels, 19th February
2009
23Concluding Remarks
NTTS 2009 Brussels 18-20 February 2009
Theory and Practice in Developing a Record
Linkage Software
- Profitable experiences in cooperation between
NSIs. - Winning choice of the open-source philosophy and
of the overcoming of ad-hoc approaches. - Common nature of problems and needs of NSIs in
data integration projects. - New Challenge
- - Add in RELAIS methods for evaluating record
linkage quality. -
Nicoletta Cibella, Brussels, 19th February
2009
24RELAIS Availability and Contacts
NTTS 2009 Brussels 18-20 February 2009
Relais 1.0 is available on the website
www.istat.it Relais 2.0 will be available on May
2009 RELAIS Contacts Nicoletta Cibella,
Statistician E-mail cibella_at_istat.it Tiziana
Tuoto, Statistician E-mail tuoto_at_istat.it
Nicoletta Cibella, Brussels, 19th February
2009