Page 1 of 29 - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Page 1 of 29

Description:

Page 1 of 29. Data Anonymization for. Privacy-Preserving ... Betty. Charles. David. Emily. Fred. The Microdata. Page 6 of 29. Real Threats of Linking Attacks ... – PowerPoint PPT presentation

Number of Views:327
Avg rating:3.0/5.0
Slides: 30
Provided by: tianch
Category:
Tags: betty | page

less

Transcript and Presenter's Notes

Title: Page 1 of 29


1
Data Anonymization forPrivacy-Preserving Data
Publishing
  • CS 526 Information Security
  • Lecture 2
  • Tiancheng Li

2
Outline
  • Background Data Publishing and Privacy
  • Privacy Measures
  • k-Anonymity
  • l-Diversity
  • t-Closeness
  • Privacy in Other Contexts
  • Privacy in Web Search
  • query log privacy
  • personalization privacy
  • Privacy in Social Networks

3
Background
  • Motivation
  • Data Collection a large amount of
    person-specific data has been collected in recent
    years.
  • Data Mining data and knowledge extracted by data
    mining techniques represent a key asset to the
    society.
  • Analyzing trends/patterns.
  • Formulating public policies.
  • Regulatory Laws some collected data must be made
    public.
  • Privacy
  • The data usually contains sensitive information
    about respondents.
  • Respondents privacy may be at risk.

4
Privacy-Preserving Data Publishing
  • Two opposing goals
  • To allow researchers to extract knowledge about
    the data
  • To protect the privacy of every individual
  • Microdata table
  • Identifier (ID), Quasi-Identifier (QID),
    Sensitive Attribute (SA)

5
Re-identification by Linking
Vote Registration Data
The Microdata
  • Anonymization
  • The first step to remove identifiers
  • Not enough by linking with external databases
  • Alice has ovarian cancer!

6
Real Threats of Linking Attacks
  • Fact 87 of the US citizens can be uniquely
    linked using only three attributes Sex
  • Sweeney Sweeney, 2002 managed to re-identify
    the medical record of the government of
    Massachusetts.
  • Census data (income), medical data, transaction
    data, tax data, etc.

7
k-Anonymity Generalization
  • k-Anonymity
  • Each record is indistinguishable from at least
    k-1 other records
  • These k records form an equivalent class
  • k-Anonymity ensures that linking cannot be
    performed with confidence 1/k.
  • Generalization
  • Replace with less-specific but semantically-consis
    tent values

Zipcode
Age
Sex
8
k-Anonymity Generalization
The Microdata
The Generalized Table
  • 3-Anonymous table
  • Suppose that the adversary knows Alices QI
    values (47677, 29, F).
  • The adversary does not know which one of the
    first 3 records corresponds to Alices record.

9
Attacks on k-Anonymity
  • k-Anonymity does not provide privacy if
  • Sensitive values in an equivalence class lack
    diversity
  • The attacker has background knowledge

A 3-anonymous patient table
Homogeneity Attack
Background Knowledge Attack
10
l-Diversity
  • Principle
  • Each equivalence class has at least l
    well-represented sensitive values
  • Distinct l-diversity
  • Each equivalence class has at least l distinct
    sensitive values
  • Probabilistic inference

8 records have HIV
10 records
2 records have other values
11
l-Diversity
  • Probabilistic l-diversity
  • The frequency of the most frequent value in an
    equivalence class is bounded by 1/l.
  • Entropy l-diversity
  • The entropy of the distribution of sensitive
    values in each equivalence class is at least
    log(l)
  • Recursive (c,l)-diversity
  • The most frequent value does not appear too
    frequently
  • r1the i-th most frequent value.

12
Limitations of l-Diversity
l-Diversity may be difficult and unnecessary to
achieve.
  • A single sensitive attribute
  • Two values HIV positive (1) and HIV negative
    (99)
  • Very different degrees of sensitivity
  • l-diversity is unnecessary to achieve
  • 2-diversity is unnecessary for an equivalence
    class that contains only negative records
  • l-diversity is difficult to achieve
  • Suppose there are 10000 records in total
  • To have distinct 2-diversity, there can be at
    most 100001100 equivalence classes

13
Limitations of l-Diversity
l-Diversity is insufficient to prevent attribute
disclosure.
Skewness Attack
  • Two sensitive values
  • HIV positive (1) and HIV negative (99)
  • Serious privacy risk
  • Consider an equivalence class that contains an
    equal number of positive records and negative
    records
  • l-diversity does not differentiate
  • Equivalence class 1 49 positive 1 negative
  • Equivalence class 2 1 positive 49 negative

l-Diversity does not consider the overall
distribution of sensitive values
14
Limitations of l-Diversity
l-Diversity is insufficient to prevent attribute
disclosure.
A 3-diverse patient table
Similarity Attack
  • Conclusion
  • Bobs salary is in 20k,40k, which is relative
    low.
  • Bob has some stomach-related disease.

l-Diversity does not consider semantic meanings
of sensitive values
15
t-Closeness A New Privacy Measure
  • Adversarial belief

A completely generalized table
ExternalKnowledge
Overall distribution Q of sensitive values
16
t-Closeness A New Privacy Measure
  • Adversarial belief

A released table
ExternalKnowledge
Overall distribution Q of sensitive values
Distribution Pi of sensitive values in each
equi-class
17
t-Closeness A New Privacy Measure
  • Adversarial belief
  • Rationale
  • Q should be public information
  • Knowledge gain is separated
  • About whole population (from B0 to B1)
  • About individuals (from B1 to B2)
  • We bound knowledge gain between B1 and B2
  • Principle
  • The distance between Q and Pi is bounded by a
    threshold t.
  • l-diversity considers only Pi

ExternalKnowledge
Overall distribution Q of sensitive values
Distribution Pi of sensitive values in each
equi-class
18
Distance Measures
  • Measure distance between
  • P(p1,p2,,pm), Q(q1,q2,,qm)
  • Distance measures
  • Trace-distance
  • KL-divergence
  • Semantic meanings
  • Q 20K,30K,40K,50K,60K,70K,80K,90K,100K
  • P120K,30K,40K
  • P220K,60K,100K
  • Intuitively, DP1,QDP2,Q
  • Sensitive values have ground distances.
  • DP,Q is dependent upon the ground distances.

19
Earth Movers Distance
  • Formulation
  • P(p1,p2,,pm), Q(q1,q2,,qm)
  • dij the ground distance between element i of P
    and element j of Q.
  • Find a flow Ffij where fij is the flow of mass
    from element i of P to element j of Q that
    minimizes the overall work
  • subject to the constraints
  • Simple formulas for EMD can be derived for
    several common ground distances.

20
Ground Distances
  • EMD for numerical attributes
  • Ordered distance
  • EMD for categorical attributes
  • Equal distance
  • Hierarchical distance

21
Types of Information Disclosure
  • Identity Disclosure
  • An individual is linked to a particular record in
    the published data.
  • k-Anonymity Sweeney, 2002.
  • Attribute Disclosure
  • Sensitive attribute information of an individual
    is disclosed.
  • l-Diversity Machanavajjhala et al., 2006.
  • t-Closeness Li et al., 2007.
  • Membership Disclosure
  • Information about whether an individuals record
    is in the published data or not is disclosed.
  • d-presence Nergiz et al., 2007.

22
Query Log Privacy
  • In August 2006, AOL query log release
  • 657K users, 20M queries, over 3 months
    (03/01-05/31)
  • A face is exposed for AOL searcher No. 4417749
  • Websites removed, employees terminated
  • Two opposing goals
  • The need of analyzing data for research/service
    purposes
  • Improve services for both users and advertisers
    (Personalized search)
  • The requirement for protecting user privacy
  • Various government laws and regulations
  • Personal information should be protected
  • Income, evaluations, intentions to acquire
    goods/services

23
Query Log Privacy
  • Structure of AOL query log
  • Note ClickURL is the truncated URL
  • Example how an anonID can be re-identified by
    NYT
  • Find all log entries for AOL User 4417749
  • Multiple queries for businesses and services in
    Lilburn, GA
  • That area has around 11K citizens
  • A number of queries for a Jarrett Arnold
  • That area has only 14 citizens with the last name
    Arnold
  • NYT contacts the 14 citizens
  • NYT finds out AOL User 4417749 is Thelma Arnod

24
Personalization Privacy
  • Personalized search/advertisements
  • What Tailored and customized search results for
    a specific user
  • How Based on personal/unique information sent to
    or generated by a personalized search provider
  • Why Search engines can have a difficult time
    with ambiguous queries

Personalized Search, in Communications of the
ACM, 2002. Privacy-Enhancing Personalized Web
Search, in WWW 2007.
25
Personalization Privacy
26
Personalization Privacy
  • Personal information used for personalization
  • Previous search histories
  • Age, name, location
  • Interests, activities, career
  • Two general approaches
  • Re-ranking results from search engines locally
    with personal information
  • Main Issue bandwidth requirements
  • Sending personal information and queries together
    to the search engine
  • Main Issue privacy issues

27
Social Networks
Friendship Graph
  • Graph
  • Nodes are entities
  • Edges are relationship between entities
  • Privacy
  • Identity disclosure
  • Link disclosure
  • Naïve Anonymization
  • Naïve anonymization is not enough
  • E.g., if the attacker knows David has
  • only one friend, David is uniquely identified

Naïve Anonymization
28
Privacy in Social Networks
Naïve Anonymization
  • Adversarial Knowledge
  • Node degrees
  • Sub-graph structures
  • Node Degrees
  • The number of neighbors a node has
  • Sub-graph Structures
  • The sub-graph structure in the neighbors of a node

29
Questions?
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com