Title: KAnonymity
1K-Anonymity
- Present By He Yan
- Jan.27, 2009
2Outline
- Introduction
- K-Anonymity
- Vulnerabilities
- Conclusion
3Data Publishing and Data Privacy
- Society is experiencing exponential growth in the
number and variety of person-specific
information. - These information is valuable both in research
and business. Data sharing is common. - Publishing the data may put the respondents
privacy in risk.
4Challenge
- How do you publicly release a database without
compromising individual privacy? - The Wrong Approach
- Just leave out any unique identifiers like name
and SSN and hope that this works. - Why?
- The triple (DOB, gender, zip code) suffices to
uniquely identify at least 87 of US citizens in
publicly available databases (1990 U.S. Census
summary data).
5Challenge(Contd)
- Re-identification by linking (Example)
- Andre has heart disease!
Hospital Patient Data
Vote Registration Data
6Objective
- Maximize data utility while limiting disclosure
risk to an acceptable level - Moral Any real privacy guarantee must be proved
and established mathematically.
7Related Works
- Statistical Databases
- The most common way is adding noise and still
maintaining some statistical invariant. - Disadvantage
- destroy the integrity of the data
8Related Works(Contd)
- Multi-level Databases
- Data is stored at different security
classifications and users having different
security clearances. - Precise inference is possible. Sensitive
information is suppressed in order to prevent
inference. - Disadvantages
- It is impossible to consider every possible
inference - Suppression can drastically reduce the quality of
the data.
9Related Works(Contd)
- Computer Security
- Access control and authentication ensure that
right people has right authority to the right
object at right time and right place. - Thats not what this work tries to solve. This
work tries to release all the information as much
as the identities of the subjects (people) are
protected.
10Related Works(Contd)
- In summary, the dramatic increase in availability
of data from autonomous data holders make the
problem more complex. - None of these works provide solutions for todays
data rich setting.
11Outline
- Introduction
- K-Anonymity
- Vulnerabilities
- Conclusion
12K-Anonymity
- What is K-Anonymity?
- If each row in the table cannot be distinguished
from at least other k-1 rows by only looking a
set of attributes, then this table is
K-anonymized on these attributes. - Example
- If you try to identify a person from a table,
but the only information you have is his birth
date and gender. There are k people meet the
requirement. This table adheres to k-Anonymity.
13Classification of Attributes
- Key Attribute
- Name, Address, Cell Phone
- which can uniquely identify an individual
directly - Always removed before release.
- Quasi-Identifier
- 5-digit ZIP code,Birth date, gender
- A set of attributes that can be potentially
linked with external information to re-identify
entities - 87 of the population in U.S. can be uniquely
identified based on these attributes, according
to the Census summary data in 1991. - Suppressed or generalized
14Classification of Attributes(Contd)
- Sensitive Attribute
- Medical record, wage,etc.
- Always released directly. These attributes is
what the researchers need. It depends on the
requirement.
Key Attribute
Quasi-Identifier
Sensive Attribute
15K-Anonymity Protection Model
- PT Private Table
- RT Released Table
- QI Quasi Identifier (Ai,,Aj)
- (A1,A2,,An) Attributes
- Definition
- Let RT(A1,...,An) be a table and QIRT be the
quasi-identifier associated with it. RT is said
to satisfy k-anonymity if and only if each
sequence of values in RTQIRT appears with at
least k occurrences in RTQIRT.
16Example
17Example
Release Table
External Data Source
Suppose you have a external data table. By
linking these 2 tables, you still dont know
Andres problem.
18Outline
- Introduction
- K-Anonymity
- Vulnerabilities
- Conclusion
19K-Anonymity Vulnerabilities
- Even when sufficient care is taken to identify
the QI, - K-Anonymity is still be vulnerable to attacks.
- Attacks
- Unsorted Matching Attack
- Complementary Release Attack
- Temporal Attack
- Fortunately, these attacks can be prevented by
following some best practices.
20Unsorted Matching Attack
- This attack is based on the order in which tuples
appear in the released table. - Solution
- Randomly sort the tuples before releasing.
21Complementary Release Attack
- Different releases can be linked together to
compromise k-anonymity. - Solution
- Consider all of the released tables before
releasing the new one, and try to avoid linking.
- Other data holders may release some data that can
be used in this kind of attack. - Generally, this kind of attack is hard to be
prohibited completely.
22Complementary Release Attack (Contd)
- Both of them are 2-anonymized and QI is Race,
Birth, Gender, ZIP. - But linking them on Problem will generate LT.
See next slide.
23Complementary Release Attack (Contd)
- In LT, White, 1964, male, 02138 and White,
1965, female, 02139 are unique. - So LT doesnt satisfy 2-anonymity.
24Temporal Attack
- Adding or removing tuples may compromise
k-anonymity protection. - Solution Subsequent releases must use the
already released table. - ??
25Outline
- Introduction
- K-Anonymity
- Vulnerabilities
- Conclusion
26Summary
- K-Anonymity attributes are suppressed or
generalized until each row is identical with at
least k-1 other rows. - K-Anonymity thus can prevent definite external
table linkages. At worst, the data released
narrows down an individual entry to a group of k
individuals. - K-Anonymity guarantees that the data released is
accurate.
27Open Issues
- How to identify a proper quasi-identifier is a
hard problem. - It depends on what the external table looks like.
- It is hard to predict what external tables will
be used to inference the sensitive information. - How to find a k-anonymity solution with
suppressing fewest cells? This leads to the next
paper. - We can suppress every cell, but this makes the
data useless. - The cost of K-Anonymous solution to a database is
the number of s introduced. - A minimum cost k-anonymity solution suppresses
the fewest number of cells necessary to guarantee
k-anonymity.
28Open Issues (Contd)
- k-anonymity does not provide privacy if
- Sensitive values in an equivalence class lack
diversity - The attacker has background knowledge
- This leads to the l-Diversity paper
A 3-anonymous patient table
Lack diversity
Background Knowledge (Carls brother has heart
disease)
29