Nicoletta Cibella1, GervasioLuis Fernandez2, Marco Fortini1, - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Nicoletta Cibella1, GervasioLuis Fernandez2, Marco Fortini1,

Description:

Sharing Solution for Record Linkage: the RELAIS Software and the Italian and ... Smith-Waterman. Q-grams. Jaro string comparator. Soundex code. TF-IDF. PRE-PROCESSING ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 25
Provided by: tab56
Category:

less

Transcript and Presenter's Notes

Title: Nicoletta Cibella1, GervasioLuis Fernandez2, Marco Fortini1,


1
NTTS 2009 Brussels 18-20 February 2009
Sharing Solution for Record Linkage the RELAIS
Software and the Italian and Spanish Experiences
  • Nicoletta Cibella1, Gervasio-Luis Fernandez2,
    Marco Fortini1,
  • Miguel Guigò2, Francisco Hernandez2, Monica
    Scannapieco1,
  • Laura Tosco1, Tiziana Tuoto1
  • 1Italian National Statistical Institute ISTAT
    Italy
  • 2Spanish National Statistical Institute INE
    Spain

2
Theory and Practice in Developing a Record
Linkage Software
NTTS 2009 Brussels 18-20 February 2009
Outline
  • The Record Linkage
  • The ESSnet on ISAD
  • The Idea and the Features of the RELAIS Software
  • The Italian and Spanish Experiences in using
    RELAIS
  • Throughout RELAIS 2.0
  • Conclusions

Nicoletta Cibella, Brussels, 19th February
2009
3
Record Linkage
NTTS 2009 Brussels 18-20 February 2009
The record linkage purpose is to identify the
same real world entity, which can be differently
represented in data sources Different approaches
to deal with record linkage Exact RL -
Deterministic RL - Probabilistic RL (Fellegi and
Sunter theory) - Bayesian RL - Machine Learning -
Knowledge Representation No particular
technique has emerged as the best solution for
all cases (maybe because such a solution does
not exist)
Nicoletta Cibella, Brussels, 19th February
2009
4
Record Linkage Complexity
NTTS 2009 Brussels 18-20 February 2009
The record linkage techniques are a
multidisciplinary set of methods and practices
  • DECISION MODEL CHOICE
  • Fellegi Sunter
  • Deterministic
  • Bayesian
  • Knowledge based
  • Mixed
  • SEARCH SPACE REDUCTION
  • Sorted Neighbourhood Method
  • Blocking
  • Hierarchical Grouping

......
RECORD LINKAGE
......
......
  • COMPARISON FUNCTION CHOICE
  • Exact
  • Edit distance
  • Smith-Waterman
  • Q-grams
  • Jaro string comparator
  • Soundex code
  • TF-IDF
  • PRE-PROCESSING
  • Conversion of upper/lower cases
  • Replacement of null strings
  • Standardization
  • Parsing

Nicoletta Cibella, Brussels, 19th February
2009
5
The Record Linkage Phases
NTTS 2009 Brussels 18-20 February 2009
  • Record Linkage should be decomposed in its
    constituting phases as much as possible
  • 1. Pre-processing of the input files
  • Creation-Reduction of the search space of link
    candidate pairs
  • Choice of the matching variables
  • 4. Choice of the comparison function
  • 5. Choice of the decision model
  • 6. Selection of unique links
  • 7. Record linkage evaluation

Nicoletta Cibella, Brussels, 19th February
2009
6
The ESSnet ISAD Integration of Surveys and
Administrative Data
NTTS 2009 Brussels 18-20 February 2009
The ESSnet and its focus The aim of the project
is to arise, in the whole ESS, knowledge and
understanding of the statistical methodologies
for the integration of two (or more) data
sources. Partners The ESSnet ISAD, cofinanced
by Eurostat, started December 2006 and ended June
2008. The project involved 5 countries ISTAT
Italy (scientific coordinator) STAT
Austria CZSO Czech Republic CBS
Netherlands INE Spain
Nicoletta Cibella, Brussels, 19th February
2009
7
RELAIS The Idea
NTTS 2009 Brussels 18-20 February 2009
  • There is not a unique optimal solution for
    solving record linkage problems for each phase
    the most appropriate technique should be chosen
  • depending on application and data requirements,
    not only on the practitioners skill
  • Ad-hoc record linkage process (workflow) should
    be dynamically built
  • RELAIS (REcord Linkage At IStat)
  • is a toolkit serving such a purpose

Nicoletta Cibella, Brussels, 19th February
2009
8
Record Linkage Workflows
NTTS 2009 Brussels 18-20 February 2009
RecLink WF Appl1
Preprocessing
UpperLowerCase
Normalization
UpperLowerCase
RecLink WF Appl2
Normalization
Schema reconciliation
SNM
Search Space Reduction
Blocking
Blocking
SNM
Comparison Function
Equality
Edit Distance
Jaro
Jaro
Equality
Decision Model
Probabilistic
Probabilistic
Empirical
Empirical
Nicoletta Cibella, Brussels, 19th February
2009
9
RELAIS Features
NTTS 2009 Brussels 18-20 February 2009
  • Modular structure each phase is planned as a
    module of the toolkit, with an explicit interface
    with the other modules
  • Top-down design this allows to omit and/or
    iterate modules (phases) of the record linkage
    process
  • Advantages
  • dynamic composition of record linkage processes
  • parallel development of various techniques is
    allowed
  • design for Web service encapsulation in order to
    permit remote invocation

Nicoletta Cibella, Brussels, 19th February
2009
10
RELAIS An Open Source Project
NTTS 2009 Brussels 18-20 February 2009
  • Results produced by the scientific community in
    the last years can be gathered and made available
  • 175 000 papers mentioning record linkage
    (Google Scholar)
  • Techniques for each phase can be implemented and
    maintained very rapidly by relying on a community
    of developers
  • RELAIS Implementation Choices
  • Java
  • R statistical language

Nicoletta Cibella, Brussels, 19th February
2009
11
RELAIS the First Release
NTTS 2009 Brussels 18-20 February 2009
  • SEARCH SPACE REDUCTION
  • Cross Product
  • Sorted Neighbourhood Method
  • Blocking
  • 11 REDUCTION
  • Optimised Transportation Problem

RELAIS 1.0
  • COMPARISON FUNCTION CHOICE
  • Equality
  • DECISION MODEL CHOICE
  • Fellegi Sunter

Nicoletta Cibella, Brussels, 19th February
2009
12
RELAIS the First Release
NTTS 2009 Brussels 18-20 February 2009
Nicoletta Cibella, Brussels, 19th February
2009
13
RELAIS in the Italian and Spanish Experiences
NTTS 2009 Brussels 18-20 February 2009
  • Common ideas and needs about the software (no
    ad-hoc solutions)
  • Sharing knowledge and cooperation started in the
    ESSnet
  • Evaluation of the RELAIS adaptability in order
    to solve also Spanish data integration problems

Nicoletta Cibella, Brussels, 19th February
2009
14
RELAIS in the Italian Tests
NTTS 2009 Brussels 18-20 February 2009
  • A Scenario the Data
  • Individuals data from the 2001 Italian Census and
    PES (about 180 000 each ones).
  • Capture-recapture model to estimate Census
    Coverage Rate,
  • - no matching errors in linking Census and PES
    records.
  • Linkage was a very complex operation
  • deterministic and probabilistic approaches and
    clerical review
  • almost 15 matching variables
  • several working months.
  • Due to the accuracy of the matching procedures
    adopted, we know the true linkage status of all
    candidate pairs.

Nicoletta Cibella, Brussels, 19th February
2009
15
RELAIS in the Italian Tests
NTTS 2009 Brussels 18-20 February 2009
A focus on Rome Size of PES and CEN files about
8 000 units each ones Cartesian Product CENxPES
more than 72 250 000 pairs (Expected link
probability 0.0001) 1 Linkage Pass Blocking
on month of birth of the household header
variable Matching Variables name, surname,
gender, day-month-year of birth
Nicoletta Cibella, Brussels, 19th February
2009
16
RELAIS in the Italian Tests
NTTS 2009 Brussels 18-20 February 2009
Results of 1 Linkage Step
Match Rate 88 False Match Rate 0.5 False
Non-Match Rate 12 The software also provides
results at the block-level MATCH RATE TOO LOW IN
COVERAGE CONTEXT
Nicoletta Cibella, Brussels, 19th February
2009
17
RELAIS in the Italian Tests
NTTS 2009 Brussels 18-20 February 2009
2 Linkage Pass Residuals of the 1 step about 1
500 units each file - mainly composed by records
with missing value in the blocking variable at
the 1 step expected-link probability
0.0003 Cartesian Product again not recommended
Blocking procedure by means of Sorted
Neighborhoods Method Sorting variable first
letter of surname window size 450 (frequency
of the most common first letter 250 ) Matching
Variables name, surname, day-month-year of birth
Nicoletta Cibella, Brussels, 19th February
2009
18
RELAIS in the Italian Tests
NTTS 2009 Brussels 18-20 February 2009
Theory and Practice in Developing a Record
Linkage Software
Results of the Overall Linkage Procedure (1 plus
2 steps)
Match Rate 98.5 False Match Rate 0.8 False
Non-Match Rate 2.3 Working Time less than 2
hours
Nicoletta Cibella, Brussels, 19th February
2009
19
RELAIS in the Italian Tests
NTTS 2009 Brussels 18-20 February 2009
Theory and Practice in Developing a Record
Linkage Software
Rome PES Workflow
RELAIS 1.0
Search Space Reduction
Blocking
SNM
Cross Product
Comparison Function
Edit Distance
Jaro-Winkler
Equality
Decision Model
Probabilistic
SNM
Linking Type
11
ManyMany
Equality
Probabilistic
11
Step 2
20
RELAIS in the Spanish Tests
NTTS 2009 Brussels 18-20 February 2009
  • A Scenario the Data
  • Individuals data from Living Conditions Survey
    (LCS) and Central Population Register (CPR)
  • 1st Main Objective obtain ID number for LCS
  • 2nd Main Objective compare the RELAIS results
    with ad-hoc procedures
  • Linkage was a very complex operation
  • only name and geographical variables were
    available
  • large amount of data.
  • Blocking on geographic areas variables

Nicoletta Cibella, Brussels, 19th February
2009
21
RELAIS in the Spanish Tests
NTTS 2009 Brussels 18-20 February 2009
  • Weaknesses of the RELAIS 1.0
  • difficulties in managing great amount of blocks
  • difficulties in dealing with different
    probability estimations in each block
  • difficulties in writing the largest output files
  • Strengths of the RELAIS 1.0
  • efficacy of the implemented probabilistic method
  • noticeable flexibility in modify/adapt the
    implemented functionalities (reduction from MN
    to 11)

Nicoletta Cibella, Brussels, 19th February
2009
22
Throughout RELAIS 2.0
NTTS 2009 Brussels 18-20 February 2009
Theory and Practice in Developing a Record
Linkage Software
  • A relational database architecture in order to
    optimize the performances with respect to the
    management of huge amount of data through the
    whole record linkage process (input, intermediate
    phase and output).
  • Several distance functions for string and
    numerical comparisons (not only the equality
    one).
  • Exact and deterministic decision models to be
    used either as alternatives or in conjunction
    with the probabilistic model.
  • A data profiling phase to help the user in the
    critical phases of choosing the best blocking or
    matching variables.
  • One-shot Execution to deal with a large amount of
    blocks.
  • RELAIS 2.0 is now on testing and will be
    available from May 2009

Nicoletta Cibella, Brussels, 19th February
2009
23
Concluding Remarks
NTTS 2009 Brussels 18-20 February 2009
Theory and Practice in Developing a Record
Linkage Software
  • Profitable experiences in cooperation between
    NSIs.
  • Winning choice of the open-source philosophy and
    of the overcoming of ad-hoc approaches.
  • Common nature of problems and needs of NSIs in
    data integration projects.
  • New Challenge
  • - Add in RELAIS methods for evaluating record
    linkage quality.

Nicoletta Cibella, Brussels, 19th February
2009
24
RELAIS Availability and Contacts
NTTS 2009 Brussels 18-20 February 2009
Relais 1.0 is available on the website
www.istat.it Relais 2.0 will be available on May
2009 RELAIS Contacts Nicoletta Cibella,
Statistician E-mail cibella_at_istat.it Tiziana
Tuoto, Statistician E-mail tuoto_at_istat.it
Nicoletta Cibella, Brussels, 19th February
2009
Write a Comment
User Comments (0)
About PowerShow.com