Open Source Deidentification Software - The SPIN/VSL Experience - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Open Source Deidentification Software - The SPIN/VSL Experience

Description:

Open Source Deidentification Software The SPINVSL Experience – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 35
Provided by: BruceBe68
Category:

less

Transcript and Presenter's Notes

Title: Open Source Deidentification Software - The SPIN/VSL Experience


1
Open Source Deidentification Software - The
SPIN/VSL Experience
  • Bruce Beckwith, MD
  • Beth Israel Deaconess Medical Center
  • Harvard Medical School

2
The Tissue Challenge
3
The Privacy Challenge
  • Protection of research subjects
  • Institutional Review Board
  • HIPAA
  • Common rule

4
The SPIN/VSL Solution
User
BWH Node
HTTPS
Query Tool
HMS Node
CHMC Node
MGH Node
BIDMC Node
Distributed databases containing de-identified
clinical data
5
Shared Pathology Informatics Network (SPIN)
  • NCI initiative
  • 5 year demonstration project
  • 2 consortia
  • Harvard/UCLA
  • Indiana/Pittsburgh
  • Built functioning network
  • Proof of concept tissue studies ongoing

6
Specific Challenges
  • Integrate heterogeneous data sources
  • Allow local control of information
  • Respect patient privacy
  • Comply with federal regulations
  • Obtaining cooperation of various institutions
  • Coordinating the IRBs of the different HIPAA
    covered entities
  • Signing up tissue repositories

7
Harvard VSL Network
  • BIDMC 318,883
  • BWH 428,226
  • MGH 100,777
  • CHMC 23,205
  • Total gt 850,000 cases
  • Live June 2005

8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
Information Pipeline
  • Extract pathology reports from LIS
  • Convert from the local format into the SPIN XML
    format
  • Remove identifying information
  • Automatically code important medical concepts
  • Load into local node database

12
Local view of institution
Pathology
SPIN Node
Network
Node Tools
Clinical
UPDATE
MPI
Institutional Systems
Institutional Firewall
Internal Threshold
13
Deidentification
  • Pathology reports always contain identifiers
  • Header information is trivial to remove since it
    resides in well defined fields
  • Identifying information embedded in text of
    pathology reports is difficult to completely
    remove

14
HIPAA and Deidentification
  • 18 categories of information defined
  • If all of this information is removed, then it is
    no longer considered Protected Health Information
    (PHI)
  • Certain non-identifying information may be left
    in
  • Ages (lt90 years)
  • Locations (state, country)

15
HIPAA Identifiers
  • Certificate/license numbers
  • Vehicle identifiers
  • Device identification numbers
  • WEB URL's
  • Internet IP address
  • Biometric identifiers (fingerprint, voice prints,
    retina scan, etc)
  • Full face photographs or comparable images
  • Any other unique number, characteristic or code
  • Names
  • ALL geographic subdivisions smaller than the
    state
  • All elements of dates smaller than a year
  • Ages over 89
  • Phone/Fax numbers
  • E-mail addresses
  • SS numbers
  • Medical record number
  • Health plan beneficiary number
  • Any other account numbers

16
Pre-existing Options
  • Pittsburgh De-ID
  • Proprietary
  • Berman concept match scrubbing
  • Alters text
  • Other tools had not been tested upon pathology
    reports

17
HMS Scrubber
  • An open source software tool for removing direct
    identifiers from text of pathology reports
  • Modular design which is easy to modify
  • Multiple development cycles
  • Final testing on 1800 cases (600 each from BIDMC,
    MGH and BWH)

18
Scrubber Design
  • Input in SPIN/VSL XML format
  • Remove identifiers specified in the header (name,
    mrn, accession number, etc.)

19
Example XML
20
Scrubber Design
  • Input in SPIN/VSL XML format
  • Remove identifiers specified in the header (name,
    mrn, accession number, etc.)
  • Search for information based on predictable
    patterns
  • Dr. Xxxx
  • Mrs. Yyyy
  • Nn/nn/nnnn
  • Dates, accession numbers
  • Use a list of prohibited words or phrases
  • Names, locations, etc

21
Regular Expresssions
  • About 50 currently
  • Predictable patterns to locate potential
    identifiers
  • Tweaked based on training sets
  • May not work well on other datasets

22
Names Lists
  • Census Names
  • http//www.census.gov/genealogy/names/names_files.
    html
  • Census Gazetteer
  • http//www.census.gov/geo/www/gazetteer/gazette.ht
    ml
  • Optional sources
  • Institution specific lists
  • Other

23
Scrubber Performance
Dept. A Dept. B Dept. C Total
Reports 600 600 600 1800
Reports with any identifier 415 239 600 1254
Unique identifiers 1079 338 2082 3499
Unique identifiers per report 1.8 0.6 3.5 1.9
BMC Med Inform Decis Mak 2006 612
24
Distribution of Identifiers
25
Scrubber Performance
Dept. A Dept. B Dept. C Total
Reports 600 600 600 1800
Reports with any identifier 415 239 600 1254
Unique identifiers 1079 338 2082 3499
Unique identifiers per report 1.8 0.6 3.5 1.9
Unique identifiers removed 1057 320 2062 3439
Unique identifiers remaining, total 22 18 20 60
Unique HIPAA identifiers remaining 11 1 7 19
Unique identifiers removed 98.0 94.7 99.0 98.3
BMC Med Inform Decis Mak 2006 612
26
Identifier Identifier Type In-house Cases Consult Cases Total
Accession number HIPAA 0 10 10
Pt name misspelled HIPAA 5 2 7
Pt name correctly spelled HIPAA 0 0 0
Medical record number HIPAA 1 0 1
Date HIPAA 1 0 1
HIPAA subtotal 7 12 19
Institution address, partial Non-HIPAA 0 17 17
Age lt90 Non-HIPAA 16 0 16
Health care organization name Non-HIPAA 0 6 6
Doctor name Non-HIPAA 1 1 2
Non-HIPAA subtotal 17 24 41
Grand total HIPAA and Non-HIPAA 24 36 60
BMC Med Inform Decis Mak 2006 612
27
Overscrubbing
Dept. A Dept. B Dept. C Total
Unique Identifiers removed 1057 320 2062 3439

Unique Overscrubs 1126 961 2584 4671
Unique Overscrubs per report 1.9 1.6 4.3 2.6

of unique phrases removed that were identifiers 48.4 25.0 44.4 42.4
BMC Med Inform Decis Mak 2006 612
28
Version 1 Scrubber
  • Written in Java
  • Requires input in SPIN XML format
  • Requires JDOM and MySQL
  • Relatively hard to implement

29
Version 2
  • Unrestricted input format
  • Increased modularity and flexibility
  • Only requires Java
  • Faster
  • However, not yet tested in production

30
Another Open Source Scrubber
  • Indiana University (Gunther Schadow)
  • Used for loading their SPIN node
  • http//aurora.regenstrief.org/schadow/text/
  • Would like to combine code in future

31
Scrubber Summary
  • gt99 of HIPAA identifiers removed
  • Performance varied by institution
  • Style differences important
  • Consult cases the most problematic
  • Need to continually validate to catch changes in
    style
  • This scrubber may be easily modified to handle
    other types of reports

32
SPIN/VSL Accomplishments
  • Designed open source peer to peer network for
    medical data sharing
  • Defined standard XML schema for representing
    pathology information
  • Created software which allows for safe use of
    information from pathology reports
  • Built national demonstration network
  • Fully approved functional network launched at
    Harvard in 2005

33
Acknowledgements
  • VSL Core Directors
  • Isaac Kohane (CH)
  • Chris Fletcher (BWH)
  • VSL Team
  • Connie Gee (DF)
  • Frank Kuo (BWH)
  • Ulysses Balis (MGH)
  • Antonio Perez-Atayde (CH)
  • Andrew McMurry (CH)
  • Raji Mahaadevan (HMS)
  • Elizabeth Sands (MGH)
  • SPIN
  • MGH Lab of Computer Science
  • Henry Chueh,
  • Roger Berkowitz,
  • Ana Holzbach
  • Indiana
  • Clem McDonald
  • Gunther Schadow
  • Univ. of Pittsburgh
  • Michael Becich
  • Rebecca Crowley
  • UCLA
  • Jonathan Braun
  • Tom Drake
  • And Many Others!

34
Websites
  • Shared Pathology Informatics Network
  • http//spin.nci.nih.gov
  • HMS Scrubber
  • Version 1 (For SPIN format XML only)
  • http//spin.nci.nih.gov/SPIN/content
  • Version 2 (Unresticted input format)
  • https//sourceforge.net/projects/spin-chirps
Write a Comment
User Comments (0)
About PowerShow.com