Andrew Borthwick, PhD - PowerPoint PPT Presentation

About This Presentation
Title:

Andrew Borthwick, PhD

Description:

ChoiceMaker Technologies, Inc. Andrew.Borthwick_at_choicemaker.com ... Doctors look up kids' immunization statuses to determine which shots to give ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 32
Provided by: arthurg
Category:

less

Transcript and Presenter's Notes

Title: Andrew Borthwick, PhD


1
The NY Citywide Immunization Registrys MEDD
De-Duplication Project
Andrew Borthwick, PhD Vikki Papadouka, PhD,
MPH Deborah Walker, PhD
ChoiceMaker Technologies, Inc. Andrew.Borthwick_at_
choicemaker.com
New York City Department of Health vpapadou_at_dohla
n.cn.ci.nyc.ny.us dwalker_at_dohlan.cn.ci.nyc.ny.us
Adapted from a presentation at the 34th National
Immunization Conference Washington, DC July 7,
2000
2
New York Citywide Immunization Registry The MEDD
De-duplication Project The NYC CIR
  • New York Citywide Immunization Registry was
    mandated in January 1997
  • All health-care providers are required to submit
    immunizations
  • Goals of the system
  • Doctors look up kids immunization statuses to
    determine which shots to give
  • Notify parents when their children are due for an
    appointment
  • Identify citywide immunization trends
  • Similar registries are being built at the state
    and local level around the country

3
New York Citywide Immunization Registry The MEDD
De-duplication Project NYC CIR Background
  • About 122,000 children are born in NYC every year
  • Each month the CIR receives
  • 50-100,000 patient records and
  • 80-200,000 immunization records
  • From gt1,100 institutions and private providers
  • Given this volume, hand-matching each new record
    before it enters the CIR is unrealistic

4
New York Citywide Immunization Registry The MEDD
De-duplication Project NYC CIR Background
  • Contains 1.8 million records
  • Very high duplication rate estimated at 3
    records 2 children because of very strict
    criteria for automatic merging
  • During April-September 1998 CIR staff reviewed
    and manually de-duplicated about 260,000 record
    pairs spent 1,700 hours

5
New York Citywide Immunization Registry The MEDD
De-duplication Project MEDD What it is
  • A system for deciding when two records represent
    the same child
  • Fast and accurate
  • Replicates the human decision-making process

6
New York Citywide Immunization Registry The MEDD
De-duplication Project MEDDs Decision-Making
Process
  • For every record pair, MEDD computes a
    probability between 0 and 100 that the pair
    should be merged
  • High probabilities ? merge
  • Low probabilities ? dont merge
  • Intermediate probabilities (close to 50)
    indicate dont know and require human review
  • Thresholds dividing the merge/ dont know/ dont
    merge cases are set by the user

7
New York Citywide Immunization Registry The MEDD
De-duplication Project Maximum Entropy Modeling
  • MEDD uses Maximum Entropy Modeling
  • A new statistical decision-making technique
  • Learn the human judgment process by training from
    examples
  • Has been used in sentence parsing, computer
    vision, financial modeling, and proper-name
    identification
  • Has achieved state-of-the-art results on these
    problems

8
New York Citywide Immunization Registry The MEDD
De-duplication Project Maximum Entropy Modeling
Features
  • Maximum Entropy uses Features
  • Feature a function which looks at specific
    fields in the pair of records to make a merge
    or dont merge decision
  • MEDD has many different features, each of which
    is assigned a weight during training

9
New York Citywide Immunization Registry The MEDD
De-duplication Project Sample MEDD Features
  • Mothers Birthday
  • Match of Moms Bday predicts Merge
  • Mismatch of Moms Bday predicts No-Merge
  • Neither feature fires if Moms Bday wasnt
    filled in on both records
  • We have no evidence in this case
  • Many other features
  • Childs birthday
  • Childs first and last name
  • Medicaid Number

10
New York Citywide Immunization Registry The MEDD
De-duplication Project Training the System
Record pairs hand-marked with merge/no-merge
decisions
A set of features
Maximum Entropy Parameter Estimator
A weight for each feature
11

New York Citywide Immunization Registry The MEDD
De-duplication Project Probability Computation
For a pair of records, MEDD computes the
probability that the pair should be merged as
Merge product of weights of all features
predicting merge for the pair NoMerge
product of weights of all features
predicting no merge for the pair
12
High Probability. Human Decision Merge
Field Name Record Record Feature Weight Prediction
Field Name 1 2 Feature Weight Prediction
Last name Smith Smith Match 1.153 Merge
First name Emily Emely No-match Soundex 1.350 4.708 No-merge Merge
DOB 04/28/97 04/28/97 Match 1.138 Merge
Multiple birth N N
Moms Maiden Name CRUZ
Mothers DOB 12/04/76
Street 4528 3rd Ave 4528 3rd Ave Match 4.342 Merge
City Bronx Bronx Match 1.103 Merge
State NY NY
Zip 10462 10462 Match 3.013 Merge
Phone 718-123-4567 718-123-6789 No-match 2.130 No-merge
Med Rec Number 11856437503 11856437503 Match 6.587 Merge
Merge Total 587.2
No-merge total 2.9
MEDD predicts Merge with 99.5 confidence
13
Low Probability. Human Decision No-Merge
Field Name Record Record Feature Weight Prediction
Field Name 1 2 Feature Weight Prediction
Last name Lopez Lopez Match 1.153 Merge
First name Girl Susan
DOB 1/11/97 1/2/97 No-match 28.949 No-merge
Multiple birth N N
Moms Maiden Name Lopez
Mothers DOB
Street 987 Cornelia 456 Park No-match 2.937 No-merge
City Brooklyn Brooklyn Match 1.103 Merge
State NY NY
Zip 11211 11211 Match 3.013 Merge
Phone 718-123-4567 718-234-5678 No-match 2.130 No-merge
Med Rec Number 1001002 567435
Merge Total 3.8
No-merge total 181.1
MEDD predicts No-merge with 97.9 confidence
14
Intermediate Probability. Human Decision Merge
Field Name Record Record Feature Weight Prediction
Field Name 1 2 Feature Weight Prediction
Last name Hernandez Hernandez Match 1.153 Merge
First name Boy David
DOB 2/14/97 2/14/97 Match 1.138 Merge
Multiple birth N N
Moms Maiden Name Hernandez
Mothers DOB 11/4/78
Street 142 4th Ave 142 4th Ave Match 4.342 Merge
City Bronx Bronx Match 1.103 Merge
State NY NY
Zip 11051 11052 No-match 2.551 No-merge
Phone 718-524-4879 718-524-4878 No-match 2.130 No-merge
Med Rec Number 1001002 567435
Merge Total 6.3
No-merge total 5.4
Predicts Merge with 53.9 confidence (Human
review)
15
New York Citywide Immunization Registry The MEDD
De-duplication Project Sophisticated MEDD
features Name Frequency
  • Name Frequency
  • Rodriguez is 9 times more common than Walker
    in NYC
  • Less than 3 kids per year are born with the names
    Borthwick and Papadouka
  • Hence we build features categorizing names as
    very common, somewhat common, very rare,
    etc.
  • Given that we have a name match, the fact that
    the names are very common is a feature predicting
    dont merge
  • A match between rare names is a feature
    predicting merge

16
New York Citywide Immunization Registry The MEDD
De-duplication Project Sophisticated MEDD
features Partial Name Match
  • Soundex A phonetic representation of names
  • Connor Conor Conner CNR
  • When the Soundex representation of two names
    matches, a feature fires predicting merge
  • Edit Distance Features firing based on two
    names having an edit distance of 1
  • Borthwich ? Borthwick ? Bortwick

17
New York Citywide Immunization Registry The MEDD
De-duplication Project Special Situation Features
  • Every database has its quirks
  • HMO XYZ always sends its data to the CIR with Day
    of Birth 1
  • Birthday July 1, 1998 not July 15, 1998
  • We have a special feature
  • If Provider HMO XYZ AND Day of Birth 1 AND
    dates differs only on day of birth, THEN predict
    merge
  • We plan to allow users to define these types of
    features themselves

18
New York Citywide Immunization Registry The MEDD
De-duplication Project Test Procedure
  • MEDD tested on c. 3,000 pairs under NYC DOH
    supervision
  • Pairs were carefully hand-scored by NYC DOH as
    Merge/Dont Merge
  • ChoiceMaker never saw the test data

19
New York Citywide Immunization Registry The MEDD
De-duplication Project MEDD Evaluation Results
Requested Accuracy of Records Needing Human Review
1 False Positive 1 False Negative 1.4
0.5 False Positive 0.5 False Negative 2.6
0.3 False Positive 0.3 False Negative 3.2
Even with double-checking, human error rate is no
better than 0.3
20
New York Citywide Immunization Registry The MEDD
De-duplication Project Summary What MEDD Offers
  • Can be trained on just 3,000 record pairs
  • Judges nearly 1,000 record-pairs per second
  • Achieves very high accuracy by finding the
    optimal weighting of the different clues
    (features) indicating merge/dont merge
  • Says merge, dont merge, or I dont know
  • Can be rigorously tested
  • Registry management can make informed judgments
    regarding the effort vs. accuracy trade-off

21
New York Citywide Immunization Registry The MEDD
De-duplication Project The 5 Stages of the
De-duplication Process
  • Blocking Identify list of possible duplicates
    (SmartSearch)
  • Decision-Making Identify a definitive list of
    duplicate records (MEDD)
  • Human Review of
  • Records marked as dont know by MEDD
  • Records held by special filters (twins, scanty
    records, etc.)
  • Linkage Link records that belong to the same
    child together (if AB and BC then AC)
  • Update the CIR

22
New York Citywide Immunization Registry The MEDD
De-duplication Project Project Avalanche
  • Project Avalanche A project by which we
    systematically de-duplicate the whole CIR by
    comparing every record to every record meeting
    certain criteria
  • Uses our querying tool Smart Search and our
    de-duplication tool MEDD
  • Project Avalanche I February-April 2000
  • Project Avalanche II May-July 2000

23
New York Citywide Immunization Registry The MEDD
De-duplication Project Project Avalanche I
  • Used strict blocking criteria for finding
    possible duplicates to be passed on to MEDD such
    as
  • Exact match on DOBMedical Record or
  • Exact match on Medicaid number or
  • First namegenderDOBlast namemaiden name (and
    vise versa) or
  • Last nameFirst nameDOB
  • Used 98 as the cut-off for automatic merging
  • Hand-reviewed records produced by the filters

24
New York Citywide Immunization Registry The MEDD
De-duplication Project Project Avalanche I
Results
Estimated
25
New York Citywide Immunization Registry The MEDD
De-duplication Project Project Avalanche II
  • In April 2000 we loaded 4 months worth of data
    that were held due to Y2K problems
  • Used more liberal blocking criteria
  • Medical Record Number
  • month and year of DOB or
  • day and year of DOB or
  • day and month of DOB or
  • first name
  • Used 90 as the cut-off for automatic merging
  • Currently hand-reviewing records produced by the
    filters

26
New York Citywide Immunization Registry The MEDD
De-duplication Project Project Avalanche II
Results
Estimated
27
New York Citywide Immunization Registry The MEDD
De-duplication Project Project Avalanche
Discussion
  • Using a very conservative cut-off for automatic
    merging we reduced the duplicates by about 27.5
    each time, more than 30 including human review
  • As a result of Project Avalanche 81 of records
    now have immunizations vs. 58 6 months ago
  • Since MEDD is not yet implemented on the front
    end of the CIR, you dont see the total number of
    duplicates decreasing over time in these early
    runs

28
New York Citywide Immunization Registry The MEDD
De-duplication Project Future of MEDD at the CIR
  • As part of the Lead and CIR integration MEDD will
    be inserted on the front end, thus reducing the
    number of duplicates being created
  • Improving MEDDs performance will enable us to
    automatically merge more duplicates with the same
    error rate
  • Will continue with Project Avalanche until we
    bring the duplication rate down to an acceptable
    level

29
New York Citywide Immunization Registry The MEDD
De-duplication Project Summary ChoiceMaker
Status
  • Currently have two employees
  • Andrew Borthwick, Ph.D.
  • Prof. Arthur Goldberg
  • Have several major contracts with New York City
    Dept. Of Health
  • Good prospects of finding similar work with other
    state and municipal health departments

30
New York Citywide Immunization Registry The MEDD
De-duplication Project Summary De-duplication
Marketplace
  • Immunization Registries have very difficult
    duplicate record problems
  • Many others have similar problems
  • Medical researchers (correlating birth
    certificate and maternal death records)
  • Banks, phone companies (correlating clients from
    different lines of business)
  • Direct marketers (merging mailing lists)

31
New York Citywide Immunization Registry The MEDD
De-duplication Project Summary ChoiceMakers
Plans
  • Do further research to decrease the amount of
    consulting time needed to deploy MEDD
  • Seeking first-round investors to fund expansion
    of RD and marketing
  • Have an opening for someone with an M.S. in C.S.
    or similar qualifications, starting 10/1/2000 and
    a C.S. Ph.D. starting 11/1/2000
Write a Comment
User Comments (0)
About PowerShow.com