Title: Selecting Records, Maintaining Uniqueness, and Minimizing Duplication in an Immunization Registry
1Selecting Records, Maintaining Uniqueness, and
Minimizing Duplication in an Immunization Registry
- Robert Rosofsky and Jonathan Mosley
- Massachusetts Immunization Information System
- Massachusetts Department of Public Health
- 305 South Street, 5th Floor
- Jamaica Plain, MA 02130
- 617-983-6836, Fax 617-983-6926
- Robert.Rosofsky_at_state.ma.us
- April, 1999
2MIIS Data Sources
- Birth records, providers, local IISs, health
networks - Quality and completeness of data from each source
varies greatly
3MIIS Identifier Fields
- Mandatory Fields
- First name
- Last name
- Date of birth
- Sex
- Preferred Fields
- Middle name
- Mothers maiden name
- Mothers date of birth
- Birth order
- Birth facility
4MIIS Data Processing Challenges
- Aggregate data from multiple sources into a
single history - Prevent duplicate records in the central database
- Process large volumes of data on a daily basis
with minimal manual intervention
5Matching Example 1
Record 1 Record 2 First Name Jasmine Jasmine
Middle Name D Danyelle Last Name Hendrix Hend
rix Date of Birth 01/01/95 01/01/95 Sex F F
Birth Order 0 Mothers Maiden Anderson Moth
ers DOB 10/10/50 10/10/50 Birthplace Code 1234
6Matching Example 2
Record 1 Record 2 First Name Jasmine Jazmynn Mid
dle Name Danyelle D Last Name Hendrix Hendricks
Date of Birth 01/01/95 01/01/95 Sex F F Birth
Order 0 Maiden Anderson Abdnerson Mothers
DOB 10/10/50 10/10/50 Birthplace Code 1234
7Basic Approach
- Prevention! Avoid duplicates in database
- Database is searched prior to inserting new
records - Compare received records to records already in
the database - Matches? Merge the records
- Doesnt match? Insert new record into database
8Assumptions Concepts
- Records can be electronically linked despite
differences between them - The more information that two records have in
common, the greater the likelihood the two
records match - Degree of similarity between records can be
expressed numerically
9Using a Matching ScoreThresholds
1.0
Records are the same Require manual
resolution Records are unique
Duplicate Threshold
Unique Threshold
0.0
Score
10Fundamental Principles
- A person querying or supplying data to the
database would provide all the identifying
information that s/he knew about an individual - Computer makes a best guess as to which
database record most closely matches the received
record
11Computing a Matching ScoreSimple Proportion
Record 1 Record 2 First Name Jasmine Jazmynn
Middle Name Danielle D Last Name Hendrix Hendr
icks Date of Birth 01/01/95 ? 01/01/95 Sex F ?
F Birth Order 0 ? 0 Maiden Anderson Abdn
erson Mothers DOB 10/10/50 ? 10/10/50 SCORE
12Computing a Matching Score Weighted Proportions
- Not all data elements are equally informative.
- Fields with extensive variation are generally the
most informative. - It is desirable to give these informative fields
more weight when computing a score.
13Matching ScoreA Weighted Proportion
Record 1 Record 2 Weight First Jasmine Jazmynn
1.2 Middle Danielle D 0.6 Last Hendrix Hendrick
s 1.5 Birth Date 01/01/95 01/01/95 1.7 ? Sex F
F 0.3 ? Birth Order 0 0 0.2 ? Mothers
Maiden Anderson Abdnerson 1.0 Mothers
DOB 10/10/50 10/10/50 1.5 ? WEIGHTED SCORE
14Matching ScoreLog Transformations
- Only those fields present in both records are
compared - Score is adjusted to reflect the number of fields
used in its calculation - Adjustment is made with a log transformation,
using the of fields used in the comparison
15Log-Transformations
Base (n) Score 3 4 5 6 7
8 0.50 0.37 0.50 0.57 0.61 0.64 0.67 0.60 0.5
4 0.63 0.68 0.71 0.74 0.75 0.70 0.68 0.74 0.78 0.8
0 0.82 0.83 0.80 0.80 0.84 0.86 0.88 0.89 0.89 0.9
0 0.90 0.92 0.93 0.94 0.95 0.95 1.00 1.00 1.00 1.0
0 1.00 1.00 1.00 Transformed score 1
Logn(Score) logn(n score)
16Scoring Equations
Simple proportion Score Weighted
proportion Score Log-transformed weighted
proportion Score where n Number of
nonblank comparisons Ignores partial scoring of
fields.
17Assigning a Field ScoreNames
- Due to the redundancy inherent in many names, it
is undesirable that they be scored in a all or
nothing manner. - Example joxathan
- The MIIS uses two methods to assign a partial
score to name fields that are not identical. - Approximate Match Method
- NYSIIS scoring
18Name StringsApproximate Match Method
- Names differ because of random differences
(e.g. typos, ignorance of true spelling) - Count minimum number of character insertions,
deletions and changes required to transform the
one name string to another
19Assigning a Field ScoreNames (Contd)
Name 1 Name 2 I - D - C Score Jonathan Jonathan
0 - 0 - 0 1.0 Hendrix Hendericks 3 - 0 -
1 0.78 Rowsofskie Rosofsky 0 - 2 -
1 0.85 Smith Smith-Jones 6 - 0 -
0 0.67 McCarthur Mac Arthur 1 - 1 -
0 0.90 John Joan 0 - 0 - 1 0.79
20Name Strings NYSIIS Method
- NYSIIS assumes that name fields differ because of
common misspellings particular to the English
language. - A code is generated for a name by assigning
specified characters to each character or group
of characters in a name. - These codes are then compared.
21Example of NYSIIS Coding
Name NYSIIS Code Hendrix handrac Hinndricks ha
ndrac Henderix handarac Rosofsky rasafsc Rowso
fskie rasafsc Knight nat Nite nat Uses a
modified version of the original NYSIIS coding.
22Names and NYSIIS (Contd)
- A NYSIIS score is computed by multiplying an
Approximate Match Score for the coded names by
the ratio of the sum of the length of the NYSIIS
codes to the sum of the lengths of the names.
23Date Scoring
- Dates are processed as strings of digits
(YYYYMMDD) - A score is computed identically to the
Approximate Matching Method used for name strings
24MIIS Scoring Example 1
Record 1 Record 2 First Name Jasmine Jazmynn Mid
dle Name Danielle D Last Name Hendrix Hendricks
Date of Birth 01/01/95 01/01/95 Sex F F Birth
Order 0 0 Mothers Maiden Anderson Abdnerson Mot
hers DOB 10/10/50 10/10/50 MIIS Score 0.80
25MIIS Scoring Example 2
Record 1 Record 2 First Name Cindy Cindy Middle
Name Elizabeth Elizabeth Last Name Castaneda Cas
taneda Date of Birth 2/5/97 2/5/97 Sex F F Birth
Order 0 0 Mothers Maiden Dentremont Cathy
Dentrement Mothers DOB 2/5/79 ltblankgt MIIS
Score 0.98
26MIIS Scoring Example 3Limited/Uninformative
Data
Record 1 Record 2 First Name Michaela Michelle M
iddle Name Kay Kelly Last Name Wronkowski Wroble
wski Date of Birth 1/5/97 1/5/97 Sex F F Birth
Order 0 0 Mothers Maiden Jones Stockdale Mother
s DOB 5/8/63 11/27/68
27MIIS Scoring Example 3 (Contd)
Fields Compared Score First, last,
DOB 0.88 First, last, DOB, sex, birth
order 0.95 First, last, DOB, sex, birth order,
Moms DOB 0.79 First, last, DOB, sex, birth
order, MDOB, m. maiden 0.67 First, last, DOB,
sex, birth ord., MDOB, m. maiden,
middle 0.60 First, last, DOB, MDOB, mothers
maiden 0.55 First, last, DOB, MDOB, mothers
maiden, middle 0.47
28MIIS Example 4Twins!!!
Record 1 Record 2 First Name Bladamir Gladamir M
iddle Name Last Name Von Nostrum Von
Nostrum Date of Birth 5/4/88 5/4/88 Sex M M Birt
h Order 1 2 Mothers MaidenThomas Thomas Mother
s DOB 9/17/60 9/17/60 MIIS Score 0.99
29Database Candidate Records
- A set of candidate records must be selected for
comparison that - Is likely to contain the record being compared
- Contains as few additional records as possible
- All records that have
- Same date of birth
- Same NYSIIS code first character for last name
are examined
30Advantages of Selecting Candidate Records
- Search strategy is apt to find a matching record
- Uses data that is contained in each record in the
database - Enhances performance as not all records are
examined
31Disadvantages of Selecting Candidate Records
- Can miss a true match if either
- date of birth is incorrect or
- first letter(s) of the last name is not a
phonetic variation of the first letter(s) of the
true last name. - Apt to return a large set of candidate records.
- Does not use alternate search strategies
employing other database fields.
32Advantages of MIIS Matching Procedures
- Allows automation of record linking and
deduplication. - Candidate records can be prioritized according to
the likelihood that they match a given set of
data elements. - Parameter driven and can be modified to
accommodate the idiosyncrasies of each data
source.
33Disadvantages of MIIS Matching Procedures
- Can make false matches when data is limited
- Requires extensive computer processing resources
34Lessons Learned
- Procedures and decisions must be data-driven.
- Deduplication will always involve manual
resolution. - Procedures will need continual evaluating and
monitoring. - Always err on the conservative side.