Protecting Respondents Identities in Microdata Release

About This Presentation

Title:

Protecting Respondents Identities in Microdata Release

Description:

K = 2. The quasi-identifier is identified as (DOB,Sex,Zip) Name. Address. City. Zip. DOB ... locally minimal generalizations and choose the globally preferred ... – PowerPoint PPT presentation

Number of Views:145

Avg rating:3.0/5.0

Slides: 13

Provided by: sno4

Category:

more less

Transcript and Presenter's Notes

Title: Protecting Respondents Identities in Microdata Release

1
Protecting Respondents Identities in Microdata
Release

Sushil Jajodia

2
References

P. Samarati, Protecting respondents identities
in microdata release, IEEE Trans. On Knowledge
and Data Engineering, Vol. 13, No. 6, 2001, pages
1010-1027.
See also papers by Latanya Sweeney, CMU

3
Outline

The problem
Related work
Generalizing data
Suppressing data
Obtaining k-minimal generalization
Conclusion

4
Re-identifying Anonymous Data by Linking Attack

Anonymous medical data

Public available voter list

Sue Carlson has Aids!

(DOB,Sex,Zip) is thus called a quasi-identifier
Assumption quasi-identifier is pre-identified

5
K-anonymity

Anonymous medical data

Public available voter list

Who has Aids?

K 2
The quasi-identifier is identified as
(DOB,Sex,Zip)

6
Related Work

The release of macrodata (i.e., tabular data
containing aggregates) without privacy breaches
is better studied than that of micro-data (i.e.,
specific tuples)
Many existing approaches on micro-data releases
are based on perturbation (adding noises), which
loses truthfulness
Others lack a formal framework
This work gives a formal foundation for the
problem (by formalizing k-anonymity, etc.),
provides two methods to achieve the goal
(generalization and suppression), and presents
algorithms for computing the desired result

7
Generalizing Data Generalization Hierarchies
Z2 220

Z1 2203,2204

2203

2204

Z0 22031,22030,22041,22044

22031

22030

22041

22044

R1 person

person

R0 asian,black,white

asian

black

white

Domain generalization hierarchy(totally ordered)
Value generalization hierarchy

8
Generalizing Data Table Generalization

ltR1,Z2gt

ltR1,Z1gt

ltR0,Z2gt

ltR1,Z0gt

ltR0,Z1gt

ltR0,Z0gt

Generalization hierarchy (partially ordered)

GT1,0

GT0,1

Table Generalization
The same number of tuples
All the domains are generalized, or remain the
same
All the values are generalized (a bijection
exists)

9
Generalizing Data k-minimal Generalization

K-minimal generalization
The generalized table GT satisfies k-anonymity
GT is minimal (no other generalized table can
satisfy k-anonymity and at the same time being
generalized by GT)
For example, GT1,0 is 2-minimal generalization
and GT0,1 is 3-minimal generalization

10
Suppressing Data Why? How?

Had the last tuple not been there, one-step
generalization on Zip is enough
Suppress the last tuple reduces the level of
generalization, and thus provides better accuracy
of released data
Given a generalization level, minimal required
suppression removes all and only the tuples that
fail the k-anonymity requirements
K-minimal generalization is revised so
suppression can be used, given that the number of
suppressed tuples is no more than a given
threshold

11
Computing k-minimal Generalization

The naïve approach
Searching for a locally minimal generalization
along each path in the table generalization
hierarchy, bottom-up
Then compare those locally minimal
generalizations and choose the globally preferred
result
Binary search
Based on a simple fact if a generalization fails
the k-anonymity criteria, then those lower than
it in the hierarchy will fail, too
Do a binary search w.r.t to the height (length of
path to the bottom) of generalizations
Preference policies
For example, if the first hierarchy has 100
elements while the second has two, then 1,0 may
be much better than 0,1
As another example, the result requiring least
suppression may be desired over those with less
generalization

12
Conclusion

The problem is to release specific tuple without
being vulnerable to linking attack of
individuals privacy
The goal is formalized as k-anonymity (i.e.,
every tuple can be linked to at least k
indistinct individuals)
Generalization is to release less specific data
such that k-anonymity can be achieved (e.g.,
22030 ? 220)
Suppression is to suppress some of the tuples in
order to avoid excessive generalization
The combination of the two methods yields the
best result
By exploiting the hierarchies, binary search can
more quickly locates the desired optimal solution