Privacy Preserving Data Publication - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Privacy Preserving Data Publication

Description:

A friend of Joe has the background knowledge: 'Joe does not have pneumonia' ... Even if an adversary can eliminate pneumonia, s/he can only assume that Joe has ... – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 62
Provided by: foxmu1
Category:

less

Transcript and Presenter's Notes

Title: Privacy Preserving Data Publication


1
Privacy Preserving Data Publication
  • Yufei Tao
  • Department of Computer Science and Engineering
  • Chinese University of Hong Kong

2
Centralized publication
  • Assume that a hospital wants to publish the
    following table, called the microdata.
  • The publication must preserve the privacy of
    patients.
  • Prevent an adversary from knowing
    who-contracted-what.

Microdata
3
Centralized publication (cont.)
  • A simple solution Remove column Name.
  • It does not work. See next.

publish
4
Linking attacks
A voter registration list
The published table
Quasi-identifier (QI) attributes
An adversary
5
These are real threats
  • Fact 87 of Americans can be uniquely identified
    by Zipcode, gender, date-of-birth.
  • A famous experiment by Sweeney International
    Journal on Uncertainty, Fuzziness and
    Knowledge-based Systems, 2002
  • finds the medical record of an ex-governor of
    Massachusetts.

6
Objectives
  • Publish a distorted version of the dataset so
    that
  • Privacy the privacy of all individuals is
    adequately protected
  • Utility the dataset is useful for analyzing the
    characteristics of the microdata.
  • Paradox Privacy protection ?, utility ?.

7
Issues
  • Privacy principle
  • What is adequate privacy protection?
  • Distortion approach
  • How to achieve the privacy principle?
  • The literature has discussed other issues as
    well.
  • Complexities, improving the utility of the
    published data, etc.

8
Principle 1 k-anonymity
Sweeney, International Journal on Uncertainty,
Fuzziness and Knowledge-based Systems, 2002
Sensitive attribute
  • 2-anonymous generalization

QI attributes
A voter registration list
4 QI groups
9
Defects of k-anonymity
  • What is the disease of Joe?

No diversity in this QI group.
A voter registration list
10
Principle 2 l-diversity
Machanavajjhala et al., ICDE, 2006
  • Each QI group should have at least l
    well-represented sensitive values.
  • Different ways to interpret well-represented.

11
Naive interpretation
  • Each QI-group has l different sensitive values.

A 2-diverse table
12
Defects of the naive interpretation
  • Assume that Joe is identified in the QI group.
    What is the probability that he contracted HIV?
  • Implication The most frequent sensitive value in
    a QI group cannot be too frequent.
  • But accomplishing only this is still vulnerable
    against attacks with background knowledge.

98 tuples
A QI group with 100 tuples
13
Background knowledge attack
  • Let Joe be an individual in the QI group having
    HIV.
  • A friend of Joe has the background knowledge
    Joe does not have pneumonia.
  • How likely would this friend assume that Joe had
    HIV?

50 tuples
A QI group with 100 tuples
49 tuples
14
Controlling also the 2nd most frequent value
  • Even if an adversary can eliminate pneumonia,
    s/he can only assume that Joe has HIV with 40 /
    70 probability.

40 tuples
A QI group with 100 tuples
30 tuples
30 tuples
15
An example of 5-diversity
The most frequent value
The 2nd most frequent value
A QI group
The 3rd most frequent value
The 4th most frequent value
The other values
16
An example of 5-diversity (cont.)
The most frequent value
A QI group
Same cardinality
The other values
17
An example of 5-diversity (cont.)
  • Assume that Joe is a person in the QI group.
  • Property If an adversary can eliminate only ? 3
    diseases, s/he can correctly guess the disease of
    Joe with at most 50 probability.

HIV
pneumonia
A QI group
bronchitis
cancer
The other values
18
l-diversity
  • Consider a QI group.
  • m is the number of sensitive values in the group.
  • r1 is the number of tuples having the most
    sensitive value.
  • r2 is the number of tuples having the 2nd most
    sensitive value.
  • rm is the number of tuples having the m-th most
    sensitive value.
  • Then, r1 ? c (rl rm), where c is a
    constant.
  • If an adversary can eliminate only l 2
    sensitive values, s/he can infer the disease of a
    person with probability at most 1 / (c 1).
  • Called (c, l)-diversity precisely.

19
Defects of l-diversity
  • Andy does not want anyone to know that he had a
    stomach problem.
  • Sarah does not mind at all if others find out
    that she had flu.

A 2-diverse table
A voter registration list
20
Defects of l-diversity (cont.)
  • Does not work if an individual can have multiple
    tuples in the microdata.

Microdata
21
Defects of l-diversity (cont.)

A 2-diverse table
A voter registration list
22
Principle 3 Personalized anonymity
Xiao and Tao, SIGMOD, 2006
  • Key ideas Guarding node sensitive attribute
    (SA) generalization
  • Assume a publicly-known hierarchy on the
    sensitive attribute.

23
Guarding node
  • Andy does not want anyone to know that he had a
    stomach problem.
  • He can specify stomach disease as the guarding
    node for his tuple.
  • Protect Andy from being conjectured to have any
    disease in the subtree of the guarding node.

24
Guarding node (cont.)
  • Sarah is willing to disclose her exact symptom.
  • She can specify Ø as the guarding node for her
    tuple.

25
Guarding node (cont.)
  • Bill does not have any special preference.
  • He sets the guarding node of his tuple to be the
    same as his sensitive value.

26
A personalized approach
27
Personalized anonymity
  • No adversary should be able to breach the privacy
    requirement of any guarding node with a
    probability above pbreach..
  • If pbreach 0.3, then no adversary can have more
    than 30 probability to find out that
  • Andy had a stomach disease
  • Bill had dyspepsia

28
Why SA generalization?
  • How many female patients are there with age above
    30?
  • 4 (60 30 1) / (60 21 1) 3
  • Real answer 1

Pure QI generalization
Microdata
29
SA generalization (cont.)

With SA generalization
Pure QI generalization
30
Evaluation of disclosure risk
  • What is the probability that the adversary can
    find out that Andy had a stomach disease?

A voter registration list
The published data
31
Combinatorial reconstruction (cont.)
  • Can each individual appear more than once?
  • No the primary case
  • Yes the non-primary case
  • Some possible reconstructions

The primary case
The non-primary case
32
Combinatorial reconstruction (cont.)
  • Can each individual appear more than once?
  • No the primary case
  • Yes the non-primary case
  • Some possible reconstructions

The primary case
The non-primary case
33
Breach probability (primary)
  • Totally 120 possible reconstructions
  • If Andy is associated with a stomach disease in
    nb reconstructions
  • The probability that the adversary should
    associate Andy with some stomach problem is nb /
    120
  • Andy is associated with
  • gastric ulcer in 24 reconstructions
  • dyspepsia in 24 reconstructions
  • gastritis in 0 reconstructions
  • nb 48
  • The breach probability for Andys tuple is 48 /
    120 2 / 5.

34
Breach probability (non-primary)
  • Totally 625 possible reconstructions
  • Andy is associated with gastric ulcer or
    dyspepsia or gastritis in 225 reconstructions.
  • nb 225
  • The breach probability for Andys tuple is
  • 225 / 625 9 / 25

35
A defect of personalized anonymity
  • Does not guard against background knowledge.
  • Recall that l-diversity can achieve this purpose.
  • But it seems possible to adapt the personalized
    approach to tackle background knowledge.
  • Future work?

36
Other privacy principles
  • k-gather.
  • Due to Aggarwal et al., PODS, 2006
  • Suffers from the problems of k-anonymity.
  • (a, k)-anonymity
  • Due to Wong et al., KDD, 2006
  • t-closeness.
  • Recently proposed by Li and Li, ICDE, 2007

37
Issues
  • Privacy principle
  • What is adequate privacy protection?
  • Distortion approach
  • How to achieve the privacy principle?

38
Three approaches
  • Suppression
  • We do not discuss it because
  • the utility of the resulting table is low
  • it can be regarded as a special case of
    generalization.
  • Generalization
  • Due to Sweeney, International Journal on
    Uncertainty, Fuzziness and Knowledge-based
    Systems, 2002
  • Anatomy (also called bucketization)
  • Due to Xiao and Tao, VLDB, 2006
  • Each of the above approaches can be integrated
    with all the privacy principles discussed
    earlier.

39
A multidimensional view of generalization
40
Taxonomy of generalization
LeFevre et al. SIGMOD, 2005
  • Local recoding
  • (Generalized) rectanglesmay overhalp.
  • Suppression is a special caseof local recoding.
  • Global recoding
  • All rectangles are disjoint.

41
Taxonomy of generalization (cont.)
  • Global recoding can be further divided.
  • Single-dimension recoding
  • Rectangles form a grid.
  • Multi-dimension recoding
  • The opposite of single-dimension recoding.

42
Taxonomy of generalization (cont.)
  • Single-dimension recoding can be further divided.
  • Full-domain recoding
  • Full-subtree recoding
  • Both assume a hierarchy on each QI attribute.
  • Example A hierarchy on Age

43
Taxonomy of generalization (cont.)
  • Full-domain recoding
  • All age values must be generalized to the same
    level of the hierachy.

44
Taxonomy of generalization (cont.)
  • Full-subtree recoding
  • The subtrees of all generalized values must be
    disjoint.
  • Permissible generalization
  • 1, 30, 31, 40, 41, 50, 51, 60, 61, 90.
  • Illegal generalization
  • 1, 10, 1, 30, 31, 60, 61, 90.

45
Why all these generalization types?
  • Reason 1If a dataset is generalized in a more
    restricted manner, less preprocessing is required
    before it can be analyzed by a standard
    statistical tool (such as SAAS).

46
Why all these generalization types?
  • Reason 2 More restrictive generalization is
    usually faster to compute and easier to analyze.

47
Why all these generalization types?
  • Reason 3 Less restrictive generalization
    promises more accurate data analysis, provided
    that a sophisticated analytical method is used.

48
Generalization algorithms
  • Operate on a quality metric. Examples
  • The generalization level (for full-domain
    recoding)
  • Total rectangle size (for local recoding)
  • Mostly heuristics-based.
  • Finding the optimal generalization is oftenNP
    hard.

49
Defect of generalization
  • Query A SELECT COUNT() from Unknown-Microdata
  • WHERE Disease pneumonia AND Age in 0, 30
  • AND Zipcode in 10001, 20000
  • Estimated answer 2p, where p is the probability
    that each of the two tuples satisfies the query
    conditions on the Age and Zipcode.

50
Defect of generalization (cont.)
  • Query A SELECT COUNT() from Unknown-Microdata
  • WHERE Disease pneumonia AND Age in 0, 30
  • AND Zipcode in 10001, 20000
  • p Area(R1 n Q ) / Area(R1) 0.05
  • Estimated answer for Query A 2p 0.1

51
Defect of generalization (cont.)
  • Query A SELECT COUNT() from Unknown-Microdata
  • WHERE Disease pneumonia AND Age in 0, 30
  • AND Zipcode in 10001, 20000
  • Estimated answer 0.1
  • The exact answer 1

52
Defect of generalization (cont.)
  • Cause of inaccuracyQI distribution inside each
    QI group is lost!

53
Anatomy
  • Releases a quasi-identifier table (QIT) and a
    sensitive table (ST).

Sensitive table (ST)
Quasi-identifier table (QIT)
Microdata
54
Anatomy (cont.)
  • 1. Decide an l-diverse partition of the tuples.

QI group 1
QI group 2
A 2-diverse partition
55
Anatomy (cont.)
  • 2. Generate a quasi-idnetifier table (QIT) and a
    sensitive table (ST) based on the selected
    partition.

group 1
group 2
quasi-identifier table (QIT)
sensitive table (ST)
56
Anatomy (cont.)
  • 2. Generate a quasi-idnetifier table (QIT) and a
    sensitive table (ST) based on the decided
    partition.

quasi-identifier table (QIT)
sensitive table (ST)
57
Privacy preservation
  • Given a pair of QIT and ST generated from an
    l-diverse partition, an adversary can infer the
    sensitive value of each individual with
    confidence at most 1 / l.

sensitive table (ST)
quasi-identifier table (QIT)
58
Data analysis
  • Query A SELECT COUNT() from Unknown-Microdata
  • WHERE Disease pneumonia AND Age in 0, 30
  • AND Zipcode in 10001, 20000

Sensitive table (ST)
Quasi-identifier table (QIT)
59
Data analysis (cont.)
  • Query A SELECT COUNT() from Unknown-Microdata
  • WHERE Disease pneumonia AND Age in 0, 30
  • AND Zipcode in 10001, 20000
  • 2 patients contracted pneumonia
  • 2 out of 4 patients satisfy the query conditions
    on Age and Zipcode
  • Estimated answer 2 2 / 4 1.

t1t2 t3 t4
60
A defect of anatomy
  • Existence breach Does an individual exist in the
    microdata?

61
Future work
  • Re-publication
  • Tackle stronger background knowledge
  • Recent work Martin et al., ICDE, 2007
  • Improving utility
  • Pioneering work Kifer and Gehrke, SIGMOD, 2006
  • Application to specific (non-trivial)
    applications
  • Location privacy
  • Pioneering work Mokbel et al., VLDB, 2006
Write a Comment
User Comments (0)
About PowerShow.com