Title: Linking Registry Data: Technical and Legal Considerations
1Linking Registry Data Technical and Legal
ConsiderationsÂ
- Sara Rosenbaum , JD
- George Washington University
- Alan F. Karr, PhD
- National Institute of Statistical Sciences
-
-
-
-
-
-
-
2Authors and Reviewers
- Authors Reviewers
- Stephen E. Fienberg (lead) Julia Lane
- Carnegie Mellon University National
Science Foundation - Sara Rosenbaum (lead) Eric Peterson
- George Washington University Duke
University Medical Center - Susan Adams (lead) Victoria Prescott
- Dartmouth College McBroom
Consulting, LLC - Alan F. Karr Gerald Riley
- National Institute of Statistical Sciences
Centers for Medicare Medicaid Services - Bradley Malin Marcy Wilder
- Vanderbilt University Hogan
Hartson - Deven McGraw
- The Center for Democracy and Technology
- Maya A. Bernstein
- Office of the Assistant Secretary for Planning
and Evaluation, DHHS - Melissa M. Goldstein
- George Washington University
- Joy Pritts
- Georgetown University
3Purpose
- Increasingly, statistical methods are used to
link data from multiple de-identified sources. - What is the risk of identifying patients by
combining data from multiple registries? - What are the legal and ethical requirements on
researchers to insure patient privacy and
confidentiality?
4Paper Overview
- INTRODUCTION
- TECHNICAL ASPECTS OF DATA LINKAGE PROJECTS
- - Linking records for research and improving
public health - - What do Privacy, Disclosure, and
Confidentiality mean? - - Linking records and probabilistic matching
- - Procedural issues in linking datasets
- LEGAL ASPECTS OF DATA LINKAGE PROJECTS
- - Risks of identification
- - The HIPAA Privacy Rule
5Paper Overview (cont.)
- D. RISK MITIGATION FOR DATA LINKAGE PROJECTS
- - Methodology for mitigating the risk of
re-identification - - Security practices, standards, and
technologies - SUMMARY
- SCENARIOS
- I. Linking Clinical Registry Data with
Insurance Claims Files - II. Planning for Data Linkage Projects
6High-Level View Linking records for research
and improving public health
- The scientific value of a registry increases with
the number of cases and the extent of the health
information included. - There is an ethical obligation to protect patient
interests when collecting, sharing, and studying
person-specific biomedical information. - Thus, a tension exists between the broad goals of
registries and regulations protecting
individually identifiable information. - A large body of federal law applies to health
information privacy.
7Key Terms
- Privacy protection of people against unallowed
uses of PII (specifically, PHI) - Disclosure Attribution of information to source
of data - Confidentiality protection accorded to
statistical data
8Technical Aspects Privacy, Disclosure and
Confidentiality
- Privacy
- As used in the HIPAA Privacy Rule, the term
applies to protected health information (PHI).
9Technical Aspects Privacy, Disclosure and
Confidentiality (cont.)
- Disclosure
- Technical the attribution of information to the
source of the data. - Identity disclosure occurs when the data source
becomes known from the data release itself - Attribute disclosure occurs when the released
data make it possible to infer the
characteristics of an individual data source more
accurately than would have otherwise been
possible - Inferential disclosure relates to the probability
of identifying a particular attribute of a data
source. - HIPAA the release, transfer, provision of,
access to, or divulging in any other manner of
information outside of the entity holding the
information.
10Technical Aspects Privacy, Disclosure and
Confidentiality (cont.)
- Confidentiality
- A quality or condition of protection accorded to
statistical information as an obligation not to
permit the transfer of that information to an
unauthorized party. - A different notion of confidentiality relates to
the ethical, legal, and professional obligation
of those who receive information in the context
of a clinical relationship to respect the privacy
interests of their patients.
11Technical Aspects Linking records and
probabilistic matching
- Techniques for record linkage
- Unique identifiers
- AI-like rule
- Probabilistic approaches
- Probabilistic approach is built on five key
components - 1. Define features that describe similarity
between records. - Place feature vectors into three classes matches
(M), non-matches (U), and possible matches (P). - Perform record-pair classification by calculating
the ratio (P (Y M)) / (P (Y U)) for each
pair, where Y is a feature vector for the pair
and P (Y M) and P (Y U) are the probabilities
of observing that feature vector for a matched
and non-matched pair. - Where no duplicate and/or non-duplicate record
pairs are available, estimate conditional
probabilities by using observed frequencies in
the records to be linked. - Blocking, or partitioning the databases based
on some variable in both databases, improves
efficiency.
12Technical Aspects Procedural issues in linking
data sets
- Neither data nor link can be defined
unambiguously, and the relationship between
datasets can vary. - Linking horizontally partitioned datasets carries
little risk of re-identification, because in most
cases there is no more information about a record
on the combined dataset than was present in the
individual datasets. - For vertically partitioned datasets, it is
necessary to link individual subjects records
that are contained in two or more datasets. This
process is risky because the combined dataset
contains more information about each subject than
either of the components. - Preferred approach methods based on cryptography
(complex and may involve a third party) - More common approach remove identifiers and
carry out statistical disclosure limitation prior
to linkage (may introduce errors into the linked
dataset that alter results of statistical
analyses)
13Technical Aspects Procedural issues in linking
data sets (cont.)
- Many linkage techniques depend on the presence of
attributes in both databases that are unique to
individuals but do not lead to re-identification.
- Linkage can reduce data quality.
- No matter how linkage is performed, other issues
should be addressed - comparable attributes should be expressed in the
same units of measure - conflicting values of attributes for each
individual common to both databases should be
reconciled - managing records that appear in only one database
(most commonly they are dropped) - consider effect of linkage on data quality
- There are unremovable risks from data linkage.
Strong consideration should be given to forms of
data protection such as licensing and restricted
access.
14Risk Mitigation Methodology for mitigating the
risk of re-identification
- Basic methodology for statistical disclosure
limitation - Disclosure limiting masks are transformations
of the data where there is a specific functional
relationship between masked values and original
data. - Can be categorized as suppressions (e.g., cell
suppression), re-codings (e.g., collapsing rows
or columns, or swapping), or samplings (e.g.,
releasing subsets). - The Risk-Utility tradeoff
- Risk of disclosure is balanced with the utility
of the released data. - Privacy-preserving data mining methodologies
- Cryptographic approaches to privacy protection
- Differential privacy focuses on algorithmic
aspects of the problem with an emphasis on
automation and scalability of a process for
conferring anonymity - Limits the information a data user might learn
beyond that known before exposure to the released
statistics
15Risk Mitigation Security practices, standards
and technologies
- Philosophies regarding the preservation of
confidentiality associated with individual-level
data - Restricted or limited information, with
restrictions on the amount or format of the data
released - Restricted or limited access, with restrictions
on the access to the information itself. - Accountability
- Ensure that researchers are accountable for the
use of datasets (e.g., best practices, unique
logins, user authentication, audit trails) - Registries as data enclaves
- Research data centers where users can access
and use data in a regulated environment - Layered restricted access to databases
- A form of layered restrictions that combines two
approaches with differing levels of access at
different levels of detail in the data
16Legal Considerations
- Critical starting point nature of the research
undertaking - Health care operations?
- HIPAA Privacy and Security Rules
- Health care quality related activities
- Public health practice?
- Research within meaning of Common Rule?
- creation of general knowledge
- Some combination of the three?
17Do HIPAA Privacy and Security Rules Apply? And if
so, What are the Issues?
- Are the data PHI, and is the source a covered
entity? If so, then HIPAA privacy and security
standards apply - Is the data source a covered entity
- (ARRA expands to include business associates)
- De-identification and re-identification of data
- Data use agreements for limited data sets
- Security obligations for ePHI
18Do the Data and Data Source Raise Other Legal
Obligations?
- Do patients and the custodial institutions from
whom the data are secured have other legal rights
and interests that create legal obligations? - E.g., more stringent state privacy laws
- Were confidentiality expectations created?
- Institutional privacy expectations
- Special federal or state standards applicable to
substance abuse or mental illness information
19Summary
- This white paper describes technical and legal
considerations for researchers interested in
creating data linkage projects involving registry
data, and presents typical linkage methods. It
also discusses both the hazards for
re-identification created by data linkage
projects, and the statistical methods used to
minimize the risk of re-identification. - Some limitations of this discussion are the
exclusion of - considerations about linking data from public and
private sectors, where different ethical and
legal restrictions may apply, and - detailed information about the risks involved
with identifying the health care providers that
collect and provide data. - Dataset linkage entails the risks of loss of
reliable confidential data management and
identification or re-identification of
individuals and institutions. Recognized and
developing statistical methods and secure
computation may limit these risks and may allow
the public health benefits that registries linked
to other datasets have the potential to
contribute.