Title: Page 1 of 29
1Data Anonymization forPrivacy-Preserving Data
Publishing
- CS 526 Information Security
- Lecture 2
- Tiancheng Li
2Outline
- Background Data Publishing and Privacy
- Privacy Measures
- k-Anonymity
- l-Diversity
- t-Closeness
- Privacy in Other Contexts
- Privacy in Web Search
- query log privacy
- personalization privacy
- Privacy in Social Networks
3Background
- Motivation
- Data Collection a large amount of
person-specific data has been collected in recent
years. - Data Mining data and knowledge extracted by data
mining techniques represent a key asset to the
society. - Analyzing trends/patterns.
- Formulating public policies.
- Regulatory Laws some collected data must be made
public. - Privacy
- The data usually contains sensitive information
about respondents. - Respondents privacy may be at risk.
4Privacy-Preserving Data Publishing
- Two opposing goals
- To allow researchers to extract knowledge about
the data - To protect the privacy of every individual
- Microdata table
- Identifier (ID), Quasi-Identifier (QID),
Sensitive Attribute (SA)
5Re-identification by Linking
Vote Registration Data
The Microdata
- Anonymization
- The first step to remove identifiers
- Not enough by linking with external databases
- Alice has ovarian cancer!
6Real Threats of Linking Attacks
- Fact 87 of the US citizens can be uniquely
linked using only three attributes Sex - Sweeney Sweeney, 2002 managed to re-identify
the medical record of the government of
Massachusetts.
- Census data (income), medical data, transaction
data, tax data, etc.
7k-Anonymity Generalization
- k-Anonymity
- Each record is indistinguishable from at least
k-1 other records - These k records form an equivalent class
- k-Anonymity ensures that linking cannot be
performed with confidence 1/k. - Generalization
- Replace with less-specific but semantically-consis
tent values
Zipcode
Age
Sex
8k-Anonymity Generalization
The Microdata
The Generalized Table
- 3-Anonymous table
- Suppose that the adversary knows Alices QI
values (47677, 29, F). - The adversary does not know which one of the
first 3 records corresponds to Alices record.
9Attacks on k-Anonymity
- k-Anonymity does not provide privacy if
- Sensitive values in an equivalence class lack
diversity - The attacker has background knowledge
A 3-anonymous patient table
Homogeneity Attack
Background Knowledge Attack
10l-Diversity
- Principle
- Each equivalence class has at least l
well-represented sensitive values - Distinct l-diversity
- Each equivalence class has at least l distinct
sensitive values - Probabilistic inference
8 records have HIV
10 records
2 records have other values
11l-Diversity
- Probabilistic l-diversity
- The frequency of the most frequent value in an
equivalence class is bounded by 1/l. - Entropy l-diversity
- The entropy of the distribution of sensitive
values in each equivalence class is at least
log(l) - Recursive (c,l)-diversity
- The most frequent value does not appear too
frequently - r1the i-th most frequent value.
12Limitations of l-Diversity
l-Diversity may be difficult and unnecessary to
achieve.
- A single sensitive attribute
- Two values HIV positive (1) and HIV negative
(99) - Very different degrees of sensitivity
- l-diversity is unnecessary to achieve
- 2-diversity is unnecessary for an equivalence
class that contains only negative records - l-diversity is difficult to achieve
- Suppose there are 10000 records in total
- To have distinct 2-diversity, there can be at
most 100001100 equivalence classes
13Limitations of l-Diversity
l-Diversity is insufficient to prevent attribute
disclosure.
Skewness Attack
- Two sensitive values
- HIV positive (1) and HIV negative (99)
- Serious privacy risk
- Consider an equivalence class that contains an
equal number of positive records and negative
records - l-diversity does not differentiate
- Equivalence class 1 49 positive 1 negative
- Equivalence class 2 1 positive 49 negative
l-Diversity does not consider the overall
distribution of sensitive values
14Limitations of l-Diversity
l-Diversity is insufficient to prevent attribute
disclosure.
A 3-diverse patient table
Similarity Attack
- Conclusion
- Bobs salary is in 20k,40k, which is relative
low. - Bob has some stomach-related disease.
l-Diversity does not consider semantic meanings
of sensitive values
15t-Closeness A New Privacy Measure
A completely generalized table
ExternalKnowledge
Overall distribution Q of sensitive values
16t-Closeness A New Privacy Measure
A released table
ExternalKnowledge
Overall distribution Q of sensitive values
Distribution Pi of sensitive values in each
equi-class
17t-Closeness A New Privacy Measure
- Rationale
- Q should be public information
- Knowledge gain is separated
- About whole population (from B0 to B1)
- About individuals (from B1 to B2)
- We bound knowledge gain between B1 and B2
- Principle
- The distance between Q and Pi is bounded by a
threshold t. - l-diversity considers only Pi
ExternalKnowledge
Overall distribution Q of sensitive values
Distribution Pi of sensitive values in each
equi-class
18Distance Measures
- Measure distance between
- P(p1,p2,,pm), Q(q1,q2,,qm)
- Distance measures
- Trace-distance
- KL-divergence
- Semantic meanings
- Q 20K,30K,40K,50K,60K,70K,80K,90K,100K
- P120K,30K,40K
- P220K,60K,100K
- Intuitively, DP1,QDP2,Q
- Sensitive values have ground distances.
- DP,Q is dependent upon the ground distances.
19Earth Movers Distance
- Formulation
- P(p1,p2,,pm), Q(q1,q2,,qm)
- dij the ground distance between element i of P
and element j of Q. - Find a flow Ffij where fij is the flow of mass
from element i of P to element j of Q that
minimizes the overall work - subject to the constraints
- Simple formulas for EMD can be derived for
several common ground distances.
20Ground Distances
- EMD for numerical attributes
- Ordered distance
- EMD for categorical attributes
- Equal distance
- Hierarchical distance
21Types of Information Disclosure
- Identity Disclosure
- An individual is linked to a particular record in
the published data. - k-Anonymity Sweeney, 2002.
- Attribute Disclosure
- Sensitive attribute information of an individual
is disclosed. - l-Diversity Machanavajjhala et al., 2006.
- t-Closeness Li et al., 2007.
- Membership Disclosure
- Information about whether an individuals record
is in the published data or not is disclosed. - d-presence Nergiz et al., 2007.
22Query Log Privacy
- In August 2006, AOL query log release
- 657K users, 20M queries, over 3 months
(03/01-05/31) - A face is exposed for AOL searcher No. 4417749
- Websites removed, employees terminated
- Two opposing goals
- The need of analyzing data for research/service
purposes - Improve services for both users and advertisers
(Personalized search) - The requirement for protecting user privacy
- Various government laws and regulations
- Personal information should be protected
- Income, evaluations, intentions to acquire
goods/services
23Query Log Privacy
- Structure of AOL query log
-
- Note ClickURL is the truncated URL
- Example how an anonID can be re-identified by
NYT - Find all log entries for AOL User 4417749
- Multiple queries for businesses and services in
Lilburn, GA - That area has around 11K citizens
- A number of queries for a Jarrett Arnold
- That area has only 14 citizens with the last name
Arnold - NYT contacts the 14 citizens
- NYT finds out AOL User 4417749 is Thelma Arnod
24Personalization Privacy
- Personalized search/advertisements
- What Tailored and customized search results for
a specific user - How Based on personal/unique information sent to
or generated by a personalized search provider - Why Search engines can have a difficult time
with ambiguous queries
Personalized Search, in Communications of the
ACM, 2002. Privacy-Enhancing Personalized Web
Search, in WWW 2007.
25Personalization Privacy
26Personalization Privacy
- Personal information used for personalization
- Previous search histories
- Age, name, location
- Interests, activities, career
- Two general approaches
- Re-ranking results from search engines locally
with personal information - Main Issue bandwidth requirements
- Sending personal information and queries together
to the search engine - Main Issue privacy issues
27Social Networks
Friendship Graph
- Graph
- Nodes are entities
- Edges are relationship between entities
- Privacy
- Identity disclosure
- Link disclosure
- Naïve Anonymization
- Naïve anonymization is not enough
- E.g., if the attacker knows David has
- only one friend, David is uniquely identified
Naïve Anonymization
28Privacy in Social Networks
Naïve Anonymization
- Adversarial Knowledge
- Node degrees
- Sub-graph structures
- Node Degrees
- The number of neighbors a node has
- Sub-graph Structures
- The sub-graph structure in the neighbors of a node
29Questions?