Title: Grigorios Loukides, PhD grigorios'loukidesvanderbilt'edu
1Anonymization of Biomedical Data Protection
Principles and Techniques
- Grigorios Loukides, PhD (grigorios.loukides_at_vand
erbilt.edu)
30 Mar. 2009, Vanderbilt University
2Lecture Organization
- Data Sharing Beliefs and Dogmas
- Privacy Principles
- Privacy Protection Methods
3What Can We Do?
- Some say, You Cant Release Any Data
Distortion, anonymity
Accuracy, quality, risk
Recipient
Data Holder
Slide from L. Sweeney
4What Can We Do?
- Others say, Privacy is Dead, Get Over It
Distortion, anonymity
Accuracy, quality, risk
Data Holder
Recipient
Slide from L. Sweeney
Others See Larry Ellison (Oracle), Scott
McNealy (Sun Micro.) Ex http//www.computerworld
.com/securitytopics/security/story/0,10801,64729,0
0.html
5Example
Legal Perspective
Is this true?
Privacy is dead
Slide from B.Malin
- http//law.vanderbilt.edu/article-search/article-d
etail/index.aspx?nid161
6What Can We Do?
- We say, Share Data While Providing Guarantees of
Anonymity
Computational solutions
Holder
Recipient
Slide from L. Sweeney
7Hide Personal Identity
- Safe to release this data?
hospital discharge summary
So, care needs to be taken!
Voter List
8Central Dogma of Re-identification
SensitiveAttribute (SA)
Quasi-identifiers (QIDs)
Quasi-identifiers (QIDs)
hospital discharge summary
Voter List
Uniqueness
Uniqueness
Linkage model
Estimated that 87 of US citizens are likely to
be identified based on Date of Birth, Sex,
5-digit zip-code Sweeney 00
9Lecture Organization
- Data Sharing Beliefs and Dogmas
- Privacy Principles
- Privacy Protection Methods
10 Types of Privacy Disclosure
Identity Disclosure
My neighbor George is 18 and went to hospital A,
hence
discharge summary of hospital A
Attribute Disclosure
11How to prevent Identity Disclosure
Uniqueness
Uniqueness
Linkage Model
Make QIDs Non-unique
Break linkage
Make Data Non-unique
Slide adapted from B. Malin
12How to prevent Identity Disclosure
Any Age,Gender,Zip-code combination
Uniqueness
Uniqueness
Linkage Model
Make QIDs Non-unique
Break linkage
Make Data Non-unique
Slide adapted from B. Malin
13The Bigger Picture
g Re-identification Method
d? Knowledge Discovery from Protected Data
f Anonymity Protection Method
Statistics / Patterns Learned
Protected Data
Original Data
d Knowledge Discovery from Unprotected Data
- Minimize the inversion of f or discovery of g.
- Minimize difference between d and d?.
Slide from B. Malin
14Principles againstIdentity Disclosure
- Main idea
- Map each tuple to at least k others w.r.t. QIDs
- Linkage is weakened - probability of success
becomes at most 1/k - We will discuss 2 privacy principles
- k-Map
- k-Anonymity
15k-Map Sweeney 01
- P Population
- V a view of P
- k-Map is satisfied, when each tuple in V
- maps to at least k-1 other subjects in P
w.r.t. QIDs
Population (P)
A view (V) that satisfies 2-map
16Formal definition of k-Map
- T Set of tuples t1, , tn
- ti a1, am ? values in QIDs A1, Am.
- S Tuples correspond to subjects
- P Subjects from a larger population
- f(t) t? is a protection function
- ?(t) omniscient function that maps tuples back
to their subjects - TRP is a relation between tuples and population
- k-map is satisfied when
- sltt,sgt ? TRP gt k (each tuple in S
matches to at least k tuples in P) - ?(t) ?(t?) (tuples t and
t refer to the same subject) - (?(t) guaranteed to be in S)
Slide adapted from B. Malin
17From k-Map to k-Anonymity
- k-Map assumes knowledge of the population P. Is
it always a valid assumption? - A hospital does only have data for its patients
(a view V of P) - Observation Each tuple t lies in P?V. Thus,
mapping t to k-1 other tuples in V w.r.t. QIDs
guarantees that t is mapped to at least k-1 other
tuples in P w.r.t. QIDs.
A view (V) that satisfies 2-map
Population (P)
18k-Anonymity Sweeney 02
- S Subjects in dataset
- k-Anonymity is satisfied, when every tuple in S
- maps to at least k-1 other subjects in S
w.r.t. QIDs
Satisfies 2-anonymity (each tuple maps to
another w.r.t. Age and Zip-code) Does it satisfy
2-map?
19Relation between k-Map and k-Anonymity
- A k-anonymous table satisfies k-map. A table that
satisfies k-map may not be k-anonymous. - k-Map offers same protection from identity
disclosure as k-anonymity, but requires less data
transformation
A view that satisfies 2-mapbut not 2-anonymity
Population (P)
20How to prevent Attribute Disclosure
- An attacker knows QIDs and attempts to guess
disease information. -
- k-Anonymity may not prevent attribute disclosure
My neighbor George is 18 and went to hospital A,
hence
G1
G2
21Principles againstAttribute Disclosure
- Main idea
- Make tuples diverse w.r.t. SA
- Correlation between QIDs and SA should be
weakened -
- We will discuss 2 privacy principles
- l-diversity
- tuple-diversity
My neighbor George is 18 and went to hospital A,
hence
G1
G2
22l-Diversity Machanavajjhala 07
- A group containing tuples with the same values
w.r.t. QIDs is l-diverse when it contains at
least l well represented SA values. - A table is l-diverse when all groups are
l-diverse
A 2-diverse table
G1
G2
23l-Diversity Machanavajjhala 07
- What does well represented mean?
- distinct l-diversity well represented means
distinct - (c,l)-diversity Let denote the number of
times the most frequent SA
value appears in a group. Given a constant , a
group is (c,l)-diverse when - (the most frequent SA value should not
appear too often)
Table satisfies distinct 2-diversity In G2, we
have r12, r21, r31 2lt2(11), hence G2 is
(2,2)-diverse Is G1 (2,2)-diverse?
G1
G2
24From l-Diversity to tuple-Diversity
- l-diversity controls the probability (inferring
an SA value given that we know QIDs). - For numerical attributes it suffices to estimate
an SA value - Estimation may be accurate even when inference
probability is small
G1
My neighbour George is 18 and went to hospital A,
hence his income 20K
G2
25tuple-Diversity Loukides 08
- Observation SA values that fall into a narrow
range may be easily estimated (less privacy). - Given a set of values V w.r.t. an SA with domain
D, and a function r() that returns the range of
values in V, tuple-diversity is defined as - Narrow range ? lower td ? higher privacy
G1
G2
26Lecture Organization
- Data Sharing Beliefs and Dogmas
- Privacy Principles
- Privacy Protection Methods
27Privacy Protection Methods
What is a good anonymization
How to derive a good anonymization
28Value recoding
- Transform QID values so that each tuple has same
values with at least k-1 others w.r.t. QIDs. - There are many ways to transform QID values
e.g. Zip-code53710,53715 or
Zip-codeMadison, WI - Organize QID values into hierarchies that specify
acceptable transformations -
-
Hierarchy for Zip-code
Hierarchy for Gender
Graphs from LeFevre 05
29Optimization Objective
d? Knowledge Discovery from Protected Data
Statistics / Patterns Learned
f Anonymity Protection Method
Protected Data
Original Data
d Knowledge Discovery from Unprotected Data
Minimize difference between d and d?
Transform QID values as less as possible to make
them k-anonymous
Slide adapted from B. Malin
30Search strategies
- Acceptable transformations for all QIDs may form
a lattice - A node is k-anonymous when applying the
transformation it implies creates a k-anonymous
table - Lower nodes imply less transformation
- Search for the k-anonymous node with the lowest
height
Graphs from LeFevre 05
31k-Anonymity Hardness
- What is an optimal transformation?
- Notation
- - a QID
- the value of a tuple w.r.t.
- the height of the hierarchy for
(e.g. 2 for Genders hierarchy) - - the level of the hierarchy for
in which lies in after
anonymization - Cost of transformation (low
level implies less cost) - Total cost of transforming all the values for all
QIDs of a table -
- Is it easy to minimize this cost?
Hierarchy for Gender
32k-Anonymity Hardness
- The problem of optimally k-anonymizing a table is
NP-hard, even when we have only 3 values
Aggarwal 05 - This is because the problem can be reduced to
EDGE PARTITION INTO TRIANGLES, which is another
NP-hard problem - Given a graph G (V,E) with E 3m for some
integer m, can the edges of G be partitioned into
m edge-disjoint triangles? - We need heuristics to solve it
- Search some of the transformations to find a
sub-optimal solution
33Search strategies
- Observations
- When a node Vi at height level i is k-anonymous,
any node with Vj s.t. jgti will be k-anonymous as
well. - If a node Vi is k-anonymous w.r.t. a set of QIDs
Q, Vi must be k-anonymous w.r.t. any subset of Q
Graphs from LeFevre 05
34Incognito Algorithm LeFevre 05
- For nodes corresponding to increasingly larger
sets of QIDs - breadth-first search in the lattice
- keep k-anonymous nodes only
35Incognito can be used to prevent Attribute
Disclosure
- For nodes corresponding to increasingly larger
sets of QIDs - breadth-first search in the lattice
- keep k-anonymous nodes that additionally satisfy
l-diversity or tuple-diversity
36Experimental results
- Incognito that searches for
- k-anonymity
- k-anonymity and l-diversity
- k-anonymity and tuple-diversity
- has been applied to US census data
- Evaluation of
- the amount of data transformation
- the level of protection from attribute disclosure
37Experimental results
- Amount of data transformation
tuple-diversity
UM measures the uncertainty in finding a QID
value 0original QIDs are preserved 1each QID
value is replaced by Loukides 09
The height of the node in the lattice that
corresponds to the resultant table Machanavajjhal
a 07
38Experimental results
- Level of protection from attribute disclosure
- Tables, groups, tuples (individuals) that do
not satisfy distinct 2-diversity - Machanavajjhala 07
- They are susceptible to attribute disclosure
when only k-anonymity applies
39- So far, we have considered demographics as QIDs
and sensitive information modeled as a simple SA
(e.g. disease or income) - Are there any alternatives?
40References
- Sweeney 00 L. Sweeney Uniqueness of Simple
Demographics in theU.S.Population, LIDAP-WP4.
Carnegie Mellon University, Laboratory for
International Data Privacy, Pittsburgh, PA
2000. - Sweeney 01 L.Sweeney, H. Abelson
Computational disclosure control a primer on
data privacy protection, 2001. - Sweeney 02 L.Sweeney K-anonymity a model for
protecting privacy. International Journal of
Uncertainty, Fuzziness Knowledge-based Systems,
2002 10(5)557-570 - Machanavajjhala 07 A. Machanavajjhala, D.
Kifer, J. Gehrke, M. Venkitasubramaniam
L-diversity Privacy beyond k-anonymity. TKDD
1(1) (2007) - Loukides 08 G. Loukides, J. Shao An Efficient
Clustering Algorithm for k -Anonymisation. J.
Comput. Sci. Technol. 23(2) 188-202 (2008) - LeFevre 05 K. LeFevre, D. J. DeWitt, R.
Ramakrishnan Incognito Efficient Full-Domain
Anonymity. SIGMOD Conference 2005 49-60 - Loukides 09 G. Loukides, A. Tziatzios, J.Shao
Towards Preference-Constrained k
Anonymisation.To appear in the Proc. of the
DASFAA International Workshop on
Privacy-Preserving Data Analysis (PPDA), 2009
Questions???