Grigorios Loukides, PhD grigorios'loukidesvanderbilt'edu - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Grigorios Loukides, PhD grigorios'loukidesvanderbilt'edu

Description:

We say, 'Share Data While Providing Guarantees of Anonymity' Holder ... Estimated that 87% of US citizens are likely to be identified based on ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 41

Provided by: scie442

Category:

more less

Transcript and Presenter's Notes

Title: Grigorios Loukides, PhD grigorios'loukidesvanderbilt'edu

1
Anonymization of Biomedical Data Protection
Principles and Techniques

Grigorios Loukides, PhD (grigorios.loukides_at_vand
erbilt.edu)

30 Mar. 2009, Vanderbilt University
2
Lecture Organization

Data Sharing Beliefs and Dogmas
Privacy Principles
Privacy Protection Methods

3
What Can We Do?

Some say, You Cant Release Any Data

Distortion, anonymity
Accuracy, quality, risk
Recipient
Data Holder
Slide from L. Sweeney
4
What Can We Do?

Others say, Privacy is Dead, Get Over It

Distortion, anonymity
Accuracy, quality, risk
Data Holder
Recipient
Slide from L. Sweeney
Others See Larry Ellison (Oracle), Scott
McNealy (Sun Micro.) Ex http//www.computerworld
.com/securitytopics/security/story/0,10801,64729,0
0.html
5
Example
Legal Perspective
Is this true?
Privacy is dead
Slide from B.Malin

http//law.vanderbilt.edu/article-search/article-d
etail/index.aspx?nid161

6
What Can We Do?

We say, Share Data While Providing Guarantees of
Anonymity

Computational solutions
Holder
Recipient
Slide from L. Sweeney
7
Hide Personal Identity

Safe to release this data?

hospital discharge summary
So, care needs to be taken!
Voter List
8
Central Dogma of Re-identification
SensitiveAttribute (SA)
Quasi-identifiers (QIDs)
Quasi-identifiers (QIDs)
hospital discharge summary
Voter List
Uniqueness
Uniqueness
Linkage model
Estimated that 87 of US citizens are likely to
be identified based on Date of Birth, Sex,
5-digit zip-code Sweeney 00
9
Lecture Organization

Data Sharing Beliefs and Dogmas
Privacy Principles
Privacy Protection Methods

10
Types of Privacy Disclosure
Identity Disclosure
My neighbor George is 18 and went to hospital A,
hence
discharge summary of hospital A
Attribute Disclosure
11
How to prevent Identity Disclosure
Uniqueness
Uniqueness
Linkage Model
Make QIDs Non-unique
Break linkage
Make Data Non-unique
Slide adapted from B. Malin
12
How to prevent Identity Disclosure
Any Age,Gender,Zip-code combination
Uniqueness
Uniqueness
Linkage Model
Make QIDs Non-unique
Break linkage
Make Data Non-unique
Slide adapted from B. Malin
13
The Bigger Picture
g Re-identification Method
d? Knowledge Discovery from Protected Data
f Anonymity Protection Method
Statistics / Patterns Learned
Protected Data
Original Data
d Knowledge Discovery from Unprotected Data

Minimize the inversion of f or discovery of g.

Minimize difference between d and d?.

Slide from B. Malin
14
Principles againstIdentity Disclosure

Main idea
Map each tuple to at least k others w.r.t. QIDs
Linkage is weakened - probability of success
becomes at most 1/k
We will discuss 2 privacy principles
k-Map
k-Anonymity

15
k-Map Sweeney 01

P Population
V a view of P
k-Map is satisfied, when each tuple in V
maps to at least k-1 other subjects in P
w.r.t. QIDs

Population (P)
A view (V) that satisfies 2-map
16
Formal definition of k-Map

T Set of tuples t1, , tn
ti a1, am ? values in QIDs A1, Am.
S Tuples correspond to subjects
P Subjects from a larger population
f(t) t? is a protection function
?(t) omniscient function that maps tuples back
to their subjects
TRP is a relation between tuples and population
k-map is satisfied when
sltt,sgt ? TRP gt k (each tuple in S
matches to at least k tuples in P)
?(t) ?(t?) (tuples t and
t refer to the same subject)
(?(t) guaranteed to be in S)

Slide adapted from B. Malin
17
From k-Map to k-Anonymity

k-Map assumes knowledge of the population P. Is
it always a valid assumption?
A hospital does only have data for its patients
(a view V of P)
Observation Each tuple t lies in P?V. Thus,
mapping t to k-1 other tuples in V w.r.t. QIDs
guarantees that t is mapped to at least k-1 other
tuples in P w.r.t. QIDs.

A view (V) that satisfies 2-map
Population (P)
18
k-Anonymity Sweeney 02

S Subjects in dataset
k-Anonymity is satisfied, when every tuple in S
maps to at least k-1 other subjects in S
w.r.t. QIDs

Satisfies 2-anonymity (each tuple maps to
another w.r.t. Age and Zip-code) Does it satisfy
2-map?
19
Relation between k-Map and k-Anonymity

A k-anonymous table satisfies k-map. A table that
satisfies k-map may not be k-anonymous.
k-Map offers same protection from identity
disclosure as k-anonymity, but requires less data
transformation

A view that satisfies 2-mapbut not 2-anonymity
Population (P)
20
How to prevent Attribute Disclosure

An attacker knows QIDs and attempts to guess
disease information.
k-Anonymity may not prevent attribute disclosure

My neighbor George is 18 and went to hospital A,
hence
G1
G2
21
Principles againstAttribute Disclosure

Main idea
Make tuples diverse w.r.t. SA
Correlation between QIDs and SA should be
weakened
We will discuss 2 privacy principles
l-diversity
tuple-diversity

My neighbor George is 18 and went to hospital A,
hence
G1
G2
22
l-Diversity Machanavajjhala 07

A group containing tuples with the same values
w.r.t. QIDs is l-diverse when it contains at
least l well represented SA values.
A table is l-diverse when all groups are
l-diverse

A 2-diverse table
G1
G2
23
l-Diversity Machanavajjhala 07

What does well represented mean?
distinct l-diversity well represented means
distinct
(c,l)-diversity Let denote the number of
times the most frequent SA
value appears in a group. Given a constant , a
group is (c,l)-diverse when
(the most frequent SA value should not
appear too often)

Table satisfies distinct 2-diversity In G2, we
have r12, r21, r31 2lt2(11), hence G2 is
(2,2)-diverse Is G1 (2,2)-diverse?
G1
G2
24
From l-Diversity to tuple-Diversity

l-diversity controls the probability (inferring
an SA value given that we know QIDs).
For numerical attributes it suffices to estimate
an SA value
Estimation may be accurate even when inference
probability is small

G1
My neighbour George is 18 and went to hospital A,
hence his income 20K
G2
25
tuple-Diversity Loukides 08

Observation SA values that fall into a narrow
range may be easily estimated (less privacy).
Given a set of values V w.r.t. an SA with domain
D, and a function r() that returns the range of
values in V, tuple-diversity is defined as
Narrow range ? lower td ? higher privacy

G1
G2
26
Lecture Organization

Data Sharing Beliefs and Dogmas
Privacy Principles
Privacy Protection Methods

27
Privacy Protection Methods
What is a good anonymization
How to derive a good anonymization
28
Value recoding

Transform QID values so that each tuple has same
values with at least k-1 others w.r.t. QIDs.
There are many ways to transform QID values
e.g. Zip-code53710,53715 or
Zip-codeMadison, WI
Organize QID values into hierarchies that specify
acceptable transformations

Hierarchy for Zip-code
Hierarchy for Gender
Graphs from LeFevre 05
29
Optimization Objective
d? Knowledge Discovery from Protected Data
Statistics / Patterns Learned
f Anonymity Protection Method
Protected Data
Original Data
d Knowledge Discovery from Unprotected Data
Minimize difference between d and d?
Transform QID values as less as possible to make
them k-anonymous
Slide adapted from B. Malin
30
Search strategies

Acceptable transformations for all QIDs may form
a lattice
A node is k-anonymous when applying the
transformation it implies creates a k-anonymous
table
Lower nodes imply less transformation
Search for the k-anonymous node with the lowest
height

Graphs from LeFevre 05
31
k-Anonymity Hardness

What is an optimal transformation?
Notation
- a QID
the value of a tuple w.r.t.
the height of the hierarchy for
(e.g. 2 for Genders hierarchy)
- the level of the hierarchy for
in which lies in after
anonymization
Cost of transformation (low
level implies less cost)
Total cost of transforming all the values for all
QIDs of a table
Is it easy to minimize this cost?

Hierarchy for Gender
32
k-Anonymity Hardness

The problem of optimally k-anonymizing a table is
NP-hard, even when we have only 3 values
Aggarwal 05
This is because the problem can be reduced to
EDGE PARTITION INTO TRIANGLES, which is another
NP-hard problem
Given a graph G (V,E) with E 3m for some
integer m, can the edges of G be partitioned into
m edge-disjoint triangles?
We need heuristics to solve it
Search some of the transformations to find a
sub-optimal solution

33
Search strategies

Observations
When a node Vi at height level i is k-anonymous,
any node with Vj s.t. jgti will be k-anonymous as
well.
If a node Vi is k-anonymous w.r.t. a set of QIDs
Q, Vi must be k-anonymous w.r.t. any subset of Q

Graphs from LeFevre 05
34
Incognito Algorithm LeFevre 05

For nodes corresponding to increasingly larger
sets of QIDs
breadth-first search in the lattice
keep k-anonymous nodes only

35
Incognito can be used to prevent Attribute
Disclosure

For nodes corresponding to increasingly larger
sets of QIDs
breadth-first search in the lattice
keep k-anonymous nodes that additionally satisfy
l-diversity or tuple-diversity

36
Experimental results

Incognito that searches for
k-anonymity
k-anonymity and l-diversity
k-anonymity and tuple-diversity
has been applied to US census data
Evaluation of
the amount of data transformation
the level of protection from attribute disclosure

37
Experimental results

Amount of data transformation

tuple-diversity
UM measures the uncertainty in finding a QID
value 0original QIDs are preserved 1each QID
value is replaced by Loukides 09
The height of the node in the lattice that
corresponds to the resultant table Machanavajjhal
a 07
38
Experimental results

Level of protection from attribute disclosure

Tables, groups, tuples (individuals) that do
not satisfy distinct 2-diversity
Machanavajjhala 07
They are susceptible to attribute disclosure
when only k-anonymity applies

So far, we have considered demographics as QIDs
and sensitive information modeled as a simple SA
(e.g. disease or income)
Are there any alternatives?

40
References

Sweeney 00 L. Sweeney Uniqueness of Simple
Demographics in theU.S.Population, LIDAP-WP4.
Carnegie Mellon University, Laboratory for
International Data Privacy, Pittsburgh, PA
2000.
Sweeney 01 L.Sweeney, H. Abelson
Computational disclosure control a primer on
data privacy protection, 2001.
Sweeney 02 L.Sweeney K-anonymity a model for
protecting privacy. International Journal of
Uncertainty, Fuzziness Knowledge-based Systems,
2002 10(5)557-570
Machanavajjhala 07 A. Machanavajjhala, D.
Kifer, J. Gehrke, M. Venkitasubramaniam
L-diversity Privacy beyond k-anonymity. TKDD
1(1) (2007)
Loukides 08 G. Loukides, J. Shao An Efficient
Clustering Algorithm for k -Anonymisation. J.
Comput. Sci. Technol. 23(2) 188-202 (2008)
LeFevre 05 K. LeFevre, D. J. DeWitt, R.
Ramakrishnan Incognito Efficient Full-Domain
Anonymity. SIGMOD Conference 2005 49-60
Loukides 09 G. Loukides, A. Tziatzios, J.Shao
Towards Preference-Constrained k
Anonymisation.To appear in the Proc. of the
DASFAA International Workshop on
Privacy-Preserving Data Analysis (PPDA), 2009