Grigorios Loukides, PhD grigorios'loukidesvanderbilt'edu - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Grigorios Loukides, PhD grigorios'loukidesvanderbilt'edu

Description:

We say, 'Share Data While Providing Guarantees of Anonymity' Holder ... Estimated that 87% of US citizens are likely to be identified based on ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 41
Provided by: scie442
Category:

less

Transcript and Presenter's Notes

Title: Grigorios Loukides, PhD grigorios'loukidesvanderbilt'edu


1
Anonymization of Biomedical Data Protection
Principles and Techniques
  • Grigorios Loukides, PhD (grigorios.loukides_at_vand
    erbilt.edu)

30 Mar. 2009, Vanderbilt University
2
Lecture Organization
  • Data Sharing Beliefs and Dogmas
  • Privacy Principles
  • Privacy Protection Methods

3
What Can We Do?
  • Some say, You Cant Release Any Data

Distortion, anonymity
Accuracy, quality, risk
Recipient
Data Holder
Slide from L. Sweeney
4
What Can We Do?
  • Others say, Privacy is Dead, Get Over It

Distortion, anonymity
Accuracy, quality, risk
Data Holder
Recipient
Slide from L. Sweeney
Others See Larry Ellison (Oracle), Scott
McNealy (Sun Micro.) Ex http//www.computerworld
.com/securitytopics/security/story/0,10801,64729,0
0.html
5
Example
Legal Perspective
Is this true?
Privacy is dead
Slide from B.Malin
  • http//law.vanderbilt.edu/article-search/article-d
    etail/index.aspx?nid161

6
What Can We Do?
  • We say, Share Data While Providing Guarantees of
    Anonymity

Computational solutions
Holder
Recipient
Slide from L. Sweeney
7
Hide Personal Identity
  • Safe to release this data?

hospital discharge summary
So, care needs to be taken!
Voter List
8
Central Dogma of Re-identification
SensitiveAttribute (SA)
Quasi-identifiers (QIDs)
Quasi-identifiers (QIDs)
hospital discharge summary
Voter List
Uniqueness
Uniqueness
Linkage model
Estimated that 87 of US citizens are likely to
be identified based on Date of Birth, Sex,
5-digit zip-code Sweeney 00
9
Lecture Organization
  • Data Sharing Beliefs and Dogmas
  • Privacy Principles
  • Privacy Protection Methods

10
Types of Privacy Disclosure
Identity Disclosure
My neighbor George is 18 and went to hospital A,
hence
discharge summary of hospital A
Attribute Disclosure
11
How to prevent Identity Disclosure
Uniqueness
Uniqueness
Linkage Model
Make QIDs Non-unique
Break linkage
Make Data Non-unique
Slide adapted from B. Malin
12
How to prevent Identity Disclosure
Any Age,Gender,Zip-code combination
Uniqueness
Uniqueness
Linkage Model
Make QIDs Non-unique
Break linkage
Make Data Non-unique
Slide adapted from B. Malin
13
The Bigger Picture
g Re-identification Method
d? Knowledge Discovery from Protected Data
f Anonymity Protection Method
Statistics / Patterns Learned
Protected Data
Original Data
d Knowledge Discovery from Unprotected Data
  • Minimize the inversion of f or discovery of g.
  • Minimize difference between d and d?.

Slide from B. Malin
14
Principles againstIdentity Disclosure
  • Main idea
  • Map each tuple to at least k others w.r.t. QIDs
  • Linkage is weakened - probability of success
    becomes at most 1/k
  • We will discuss 2 privacy principles
  • k-Map
  • k-Anonymity

15
k-Map Sweeney 01
  • P Population
  • V a view of P
  • k-Map is satisfied, when each tuple in V
  • maps to at least k-1 other subjects in P
    w.r.t. QIDs

Population (P)
A view (V) that satisfies 2-map
16
Formal definition of k-Map
  • T Set of tuples t1, , tn
  • ti a1, am ? values in QIDs A1, Am.
  • S Tuples correspond to subjects
  • P Subjects from a larger population
  • f(t) t? is a protection function
  • ?(t) omniscient function that maps tuples back
    to their subjects
  • TRP is a relation between tuples and population
  • k-map is satisfied when
  • sltt,sgt ? TRP gt k (each tuple in S
    matches to at least k tuples in P)
  • ?(t) ?(t?) (tuples t and
    t refer to the same subject)
  • (?(t) guaranteed to be in S)

Slide adapted from B. Malin
17
From k-Map to k-Anonymity
  • k-Map assumes knowledge of the population P. Is
    it always a valid assumption?
  • A hospital does only have data for its patients
    (a view V of P)
  • Observation Each tuple t lies in P?V. Thus,
    mapping t to k-1 other tuples in V w.r.t. QIDs
    guarantees that t is mapped to at least k-1 other
    tuples in P w.r.t. QIDs.

A view (V) that satisfies 2-map
Population (P)
18
k-Anonymity Sweeney 02
  • S Subjects in dataset
  • k-Anonymity is satisfied, when every tuple in S
  • maps to at least k-1 other subjects in S
    w.r.t. QIDs

Satisfies 2-anonymity (each tuple maps to
another w.r.t. Age and Zip-code) Does it satisfy
2-map?
19
Relation between k-Map and k-Anonymity
  • A k-anonymous table satisfies k-map. A table that
    satisfies k-map may not be k-anonymous.
  • k-Map offers same protection from identity
    disclosure as k-anonymity, but requires less data
    transformation

A view that satisfies 2-mapbut not 2-anonymity
Population (P)
20
How to prevent Attribute Disclosure
  • An attacker knows QIDs and attempts to guess
    disease information.
  • k-Anonymity may not prevent attribute disclosure

My neighbor George is 18 and went to hospital A,
hence
G1
G2
21
Principles againstAttribute Disclosure
  • Main idea
  • Make tuples diverse w.r.t. SA
  • Correlation between QIDs and SA should be
    weakened
  • We will discuss 2 privacy principles
  • l-diversity
  • tuple-diversity

My neighbor George is 18 and went to hospital A,
hence
G1
G2
22
l-Diversity Machanavajjhala 07
  • A group containing tuples with the same values
    w.r.t. QIDs is l-diverse when it contains at
    least l well represented SA values.
  • A table is l-diverse when all groups are
    l-diverse

A 2-diverse table
G1
G2
23
l-Diversity Machanavajjhala 07
  • What does well represented mean?
  • distinct l-diversity well represented means
    distinct
  • (c,l)-diversity Let denote the number of
    times the most frequent SA
    value appears in a group. Given a constant , a
    group is (c,l)-diverse when
  • (the most frequent SA value should not
    appear too often)

Table satisfies distinct 2-diversity In G2, we
have r12, r21, r31 2lt2(11), hence G2 is
(2,2)-diverse Is G1 (2,2)-diverse?
G1
G2
24
From l-Diversity to tuple-Diversity
  • l-diversity controls the probability (inferring
    an SA value given that we know QIDs).
  • For numerical attributes it suffices to estimate
    an SA value
  • Estimation may be accurate even when inference
    probability is small

G1
My neighbour George is 18 and went to hospital A,
hence his income 20K
G2
25
tuple-Diversity Loukides 08
  • Observation SA values that fall into a narrow
    range may be easily estimated (less privacy).
  • Given a set of values V w.r.t. an SA with domain
    D, and a function r() that returns the range of
    values in V, tuple-diversity is defined as
  • Narrow range ? lower td ? higher privacy

G1
G2
26
Lecture Organization
  • Data Sharing Beliefs and Dogmas
  • Privacy Principles
  • Privacy Protection Methods

27
Privacy Protection Methods
What is a good anonymization
How to derive a good anonymization
28
Value recoding
  • Transform QID values so that each tuple has same
    values with at least k-1 others w.r.t. QIDs.
  • There are many ways to transform QID values
    e.g. Zip-code53710,53715 or
    Zip-codeMadison, WI
  • Organize QID values into hierarchies that specify
    acceptable transformations

Hierarchy for Zip-code
Hierarchy for Gender
Graphs from LeFevre 05
29
Optimization Objective
d? Knowledge Discovery from Protected Data
Statistics / Patterns Learned
f Anonymity Protection Method
Protected Data
Original Data
d Knowledge Discovery from Unprotected Data
Minimize difference between d and d?
Transform QID values as less as possible to make
them k-anonymous
Slide adapted from B. Malin
30
Search strategies
  • Acceptable transformations for all QIDs may form
    a lattice
  • A node is k-anonymous when applying the
    transformation it implies creates a k-anonymous
    table
  • Lower nodes imply less transformation
  • Search for the k-anonymous node with the lowest
    height

Graphs from LeFevre 05
31
k-Anonymity Hardness
  • What is an optimal transformation?
  • Notation
  • - a QID
  • the value of a tuple w.r.t.
  • the height of the hierarchy for
    (e.g. 2 for Genders hierarchy)
  • - the level of the hierarchy for
    in which lies in after
    anonymization
  • Cost of transformation (low
    level implies less cost)
  • Total cost of transforming all the values for all
    QIDs of a table
  • Is it easy to minimize this cost?

Hierarchy for Gender
32
k-Anonymity Hardness
  • The problem of optimally k-anonymizing a table is
    NP-hard, even when we have only 3 values
    Aggarwal 05
  • This is because the problem can be reduced to
    EDGE PARTITION INTO TRIANGLES, which is another
    NP-hard problem
  • Given a graph G (V,E) with E 3m for some
    integer m, can the edges of G be partitioned into
    m edge-disjoint triangles?
  • We need heuristics to solve it
  • Search some of the transformations to find a
    sub-optimal solution

33
Search strategies
  • Observations
  • When a node Vi at height level i is k-anonymous,
    any node with Vj s.t. jgti will be k-anonymous as
    well.
  • If a node Vi is k-anonymous w.r.t. a set of QIDs
    Q, Vi must be k-anonymous w.r.t. any subset of Q

Graphs from LeFevre 05
34
Incognito Algorithm LeFevre 05
  • For nodes corresponding to increasingly larger
    sets of QIDs
  • breadth-first search in the lattice
  • keep k-anonymous nodes only

35
Incognito can be used to prevent Attribute
Disclosure
  • For nodes corresponding to increasingly larger
    sets of QIDs
  • breadth-first search in the lattice
  • keep k-anonymous nodes that additionally satisfy
    l-diversity or tuple-diversity

36
Experimental results
  • Incognito that searches for
  • k-anonymity
  • k-anonymity and l-diversity
  • k-anonymity and tuple-diversity
  • has been applied to US census data
  • Evaluation of
  • the amount of data transformation
  • the level of protection from attribute disclosure

37
Experimental results
  • Amount of data transformation

tuple-diversity
UM measures the uncertainty in finding a QID
value 0original QIDs are preserved 1each QID
value is replaced by Loukides 09
The height of the node in the lattice that
corresponds to the resultant table Machanavajjhal
a 07
38
Experimental results
  • Level of protection from attribute disclosure
  • Tables, groups, tuples (individuals) that do
    not satisfy distinct 2-diversity
  • Machanavajjhala 07
  • They are susceptible to attribute disclosure
    when only k-anonymity applies

39
  • So far, we have considered demographics as QIDs
    and sensitive information modeled as a simple SA
    (e.g. disease or income)
  • Are there any alternatives?

40
References
  • Sweeney 00 L. Sweeney Uniqueness of Simple
    Demographics in theU.S.Population, LIDAP-WP4.
    Carnegie Mellon University, Laboratory for
    International Data Privacy, Pittsburgh, PA
    2000.
  • Sweeney 01 L.Sweeney, H. Abelson
    Computational disclosure control a primer on
    data privacy protection, 2001.
  • Sweeney 02 L.Sweeney K-anonymity a model for
    protecting privacy. International Journal of
    Uncertainty, Fuzziness Knowledge-based Systems,
    2002 10(5)557-570
  • Machanavajjhala 07 A. Machanavajjhala, D.
    Kifer, J. Gehrke, M. Venkitasubramaniam
    L-diversity Privacy beyond k-anonymity. TKDD
    1(1) (2007)
  • Loukides 08 G. Loukides, J. Shao An Efficient
    Clustering Algorithm for k -Anonymisation. J.
    Comput. Sci. Technol. 23(2) 188-202 (2008)
  • LeFevre 05 K. LeFevre, D. J. DeWitt, R.
    Ramakrishnan Incognito Efficient Full-Domain
    Anonymity. SIGMOD Conference 2005 49-60
  • Loukides 09 G. Loukides, A. Tziatzios, J.Shao
    Towards Preference-Constrained k
    Anonymisation.To appear in the Proc. of the
    DASFAA International Workshop on
    Privacy-Preserving Data Analysis (PPDA), 2009

Questions???
Write a Comment
User Comments (0)
About PowerShow.com