Global Disclosure Risk for Microdata with Continuous Attributes - PowerPoint PPT Presentation

About This Presentation
Title:

Global Disclosure Risk for Microdata with Continuous Attributes

Description:

The Health Insurance Portability and Accountability Act (1996) ... Biometric identifiers (finger prints) Full face photo images. Unique identifying ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 39
Provided by: danb190
Category:

less

Transcript and Presenter's Notes

Title: Global Disclosure Risk for Microdata with Continuous Attributes


1
Global Disclosure Risk for Microdata with
Continuous Attributes
  • Traian Marius Truta
  • Northern Kentucky University

2
HIPAA Privacy Rule
  • The Health Insurance Portability and
    Accountability Act (1996)
  • The Privacy Rule protects the privacy of the
    individually identifiable health information by
    establishing conditions for its use and
    disclosure
  • Privacy Rule effective date 14 April 2003
  • Define 18 identifiers that must be removed in
    order to de-identify the data

3
The Identifiers in the Privacy Rule
  • Names
  • Telephone
  • Fax
  • E-mail address
  • Social Security
  • Medical record, prescription
  • Health Plan beneficiary
  • Account
  • Certificates/license
  • VIN and serial , license plate
  • Device identifiers, serial ,
  • Web URLs
  • IP address
  • Biometric identifiers (finger prints)
  • Full face photo images
  • Unique identifying

4
The Identifiers in the Privacy Rule
  • Names
  • Telephone
  • Fax
  • E-mail address
  • Social Security
  • Medical record, prescription
  • Health Plan beneficiary
  • Account
  • Certificates/license
  • VIN and serial , license plate
  • Device identifiers, serial ,
  • Web URLs
  • IP address
  • Biometric identifiers (finger prints)
  • Full face photo images
  • Unique identifying
  • Geographic info (including city, state, and zip)
  • Elements of dates

5
De-identification Process
  • Remove all 18 defined identifiers and no
    knowledge that remaining information can identify
    the individual (Safe Harbor)
  • Statistically de-identified information where a
    statistician certifies that there is a very
    small risk that the information could be used to
    identify the individual

6
Disclosure Control Problem
Individuals
Data
Submit Collect
Masking Process
Data Owner
Masked Data
Release Receive
Researcher Intruder
7
Disclosure Control Problem
Individuals
Data
Submit Collect
Confidentiality of Individuals
Measures of Disclosure Risk
Masking Process
Data Owner
Preserve Data Utility
Measures of Information Loss
Masked Data
Release Receive
Researcher Intruder
8
Disclosure Control Problem
Individuals
Data
Submit Collect
Confidentiality of Individuals
Measures of Disclosure Risk
Masking Process
Data Owner
Preserve Data Utility
Measures of Information Loss
Masked Data
Release Receive
Researcher Intruder
Use Masked Data for Statistical Analysis
Use Masked Data and External Data to disclose
confidential information
External Data
9
Disclosure Control Problem
Individuals
This Presentation
Data
Submit Collect
Confidentiality of Individuals
Measures of Disclosure Risk
Masking Process
Data Owner
Preserve Data Utility
Measures of Information Loss
Masked Data
Release Receive
Researcher Intruder
Use Masked Data for Statistical Analysis
Use Masked Data and External Data to disclose
confidential information
External Data
10
General Framework for Microdata
  • I Identifier Attributes (Name, SSN, etc. )
  • K Key Attributes (Zip Code, Age, Race, etc.)
  • S Confidential Attributes (Income, Diagnosis,
    etc.)

11
Disclosure Control Techniques
  • Different disclosure control techniques are
    applied to the following initial microdata

12
Remove Identifiers
  • Identifiers such as Names, SSN etc. are removed

13
Sampling
  • Sampling is the disclosure control method in
    which only a subset of records is released
  • If n is the number of elements in initial
    microdata and t the released number of elements
    we call sf t / n the sampling factor
  • Simple random sampling is more frequently used.
    In this technique, each individual is chosen
    entirely by chance and each member of the
    population has an equal chance of being included
    in the sample

14
Microaggregation
  • Order records from the initial microdata by an
    attribute, create groups of consecutive values,
    replace those values by the group average
  • Microaggregation for attribute Income and
    minimum size 3
  • The total sum for all Income values remains the
    same.

15
Global Disclosure Risk Measures
  • Assumptions
  • The intruder does not know any confidential
    information
  • The intruder knows all the key and identifier
    values for population
  • Objectives
  • DR Measures for specific DC methods (Remove
    Identifiers, Sampling, Microaggregation, etc.)
  • DR Measures for any combinations of DC methods
  • Proposed measures
  • DRmin ? DRW ? DRmax

16
Notations for IM and IMM
  • n the number of entities in the population.
  • F the number of clusters with the same values
    for key attributes.
  • Ak the set of elements from the k-th cluster
    for all k, 1 ? k ? F.
  • Fi Ak Ak i, for all k 1, .., F
    for all i, 1 ? i ? n. Fi represents the number of
    clusters with the same length.
  • ni x ? Ak Ak i, for all k 1, .., F
    for all i, 1 ? i ? n. ni represents the number
    of records in clusters of length i.

17
Disclosure Risk Measures for Remove Identifiers
Method
  • 1, 2, 4
  • 3, 5, 9
  • 6, 10
  • 7
  • 8

18
Disclosure Risk Measures for Remove Identifiers
Method
- percentage of unique records
  • - considers probabilistic linkage

- weights defined by data owner
w (w1, w2, , wN) disclosure risk weight
vector. Properties a) wi ? R for all i 1, .. ,
n b) wi ? wj for all i ? j, i,j 1, .. , n
19
Disclosure Risk Measures for Remove Identifiers
Method
  • w1 (5, 5, 0, 0, ..., 0)
  • w2 (4, 3, 3, 0, ..., 0)

20
Disclosure Risk Measures for RI Method with
Continuous Attribute
  • What if the intruder has only approximations of
    income?
  • w1 (5, 5, 0, 0, ..., 0)
  • w2 (4, 3, 3, 0, ..., 0)

21
Disclosure Risk Measures for RI Method with
Continuous Attribute
  • We consider vicinity sets!
  • w1 (5, 5, 0, 0, ..., 0)
  • w2 (4, 3, 3, 0, ..., 0)

22
Notations for Masked Microdata
  • f the number of clusters with the same values
    for key attributes in M.
  • We cluster all records from M based on their key
    values. Bk the set of elements from the k-th
    cluster for all k, 1 ? k ? f.
  • fi Bk Bk i, for all k 1, .., f
    for all i, 1 ? i ? n. fi represents the number of
    clusters with the same length.
  • ti x ? Bk Bk i, for all k 1, .., f
    for all i, 1 ? i ? n. ti represents the number
    of records in clusters of length i.
  • C the classification matrix. For all i, j 1,
    .., n cij x ? Bk and x ? Ap Bk i, for
    all k 1, .., f and Ap j, for all p 1,
    .., F . Each element of C, cij, represents the
    number of records that appears in clusters of
    size i in the masked microdata and appeared in
    clusters of size j in the initial masked
    microdata.

23
Algorithm for Creating Classification Matrix
  • Initialize each element from C with 0.
  • For each element s from masked microdata MM do
  • Count the number of occurrences of key values of
    s in masked microdata MM.Let i be this number.
  • Count the number of occurrences of key values of
    s in initial microdata IM.Let j be this number.
  • Increment cij by 1.
  • End for.

24
Disclosure Risk Measures for Microaggregation
Method
  • What if data is continuous ?

25
Disclosure Risk Measures for Microaggregation
Method
Initial Microdata
26
Disclosure Risk Measures for Microaggregation
Method
  • Univariate microaggregation for attribute Age and
    size 2,4,8

Masked Microdata 1
Masked Microdata 2
Masked Microdata 3
27
Disclosure Risk Measures for Microaggregation
Method
28
Disclosure Risk Measures for Microaggregation
Method
Example Disclosure risk values NO VICINITY!
29
Disclosure Risk Measures for Microaggregation
Method
Example Disclosure risk values WITH VICINITY!
30
General Disclosure Risk Measures
  • icfk inversion-change factor for attribute k
  • p number of key attributes
  • v binary vector associated to key attribute

31
Experimental Data
  • Simulated medical record billing data
  • Age, Sex, Zip and Amount_Billed
  • Three initial microdata
  • n 1,000 (called IM1000)
  • n 5,000 (IM5000)
  • n 25,000 (IM25000)
  • Key attributes
  • KA1 Age, Sex, Zip
  • KA2 Age, Sex

32
Results for Sampling and Microaggregation
  • Sampling, followed by microaggregation for Age
    when IM5000 and KA1 are used.

33
Results for Sampling and Microaggregation
  • Sampling and microaggregation for Age when IM5000
    and KA1 are used.

34
Conclusions
  • The data owner may customize its disclosure risk
    measure to reflect better the characteristics of
    the microdata. Privacy requirements may help data
    owner to define the disclosure risk weight
    matrix.
  • Importance of masking key attributes with small
    vicinity sets

35
Future Work
  • Our experiments were focused on healthcare
    microdata experiments for other types of data,
    such as financial data are needed.
  • To study disclosure control for microdata under
    the assumption that the initial microdata is
    frequently updated (Dynamic Disclosure Control)

36
Some Papers
  • Details about DR Measures
  • Disclosure Risk Measures for Sampling Disclosure
    Control Method, to appear in the Proceedings of
    ACM Symposium on Applied Computing (SAC2004),
    special track on Computer Applications in Health
    Care (COMPAHEC2004), Nicosia, Cyprus
  • Disclosure Risk Measures for Microdata,
    Proceedings of the International Conference on
    Scientific and Statistical Database Management
    (SSDBM2003), Cambridge, Ma, pp. 15 22, 2003
  • Information Loss Measures
  • Privacy and Confidentiality Management for the
    Microaggregation Disclosure Control Method,
    Proceedings of the Workshop on Privacy and
    Electronic Society (WPES2003), In Conjunction
    with 10th ACM CCS, Washington DC, pp. 21 30,
    2003
  • Automatic Masked Microdata Generator
  • Automatic Generation of Masked Microdata, to
    appear in the Acta Universitatis Apulensis, Alba
    Iulia, Romania

37
Acknowledgements
  • Dr. Farshad Fotouhi
  • Dr. Daniel Barth-Jones

38
Questions?
Write a Comment
User Comments (0)
About PowerShow.com