Sumathie Sundaresan - PowerPoint PPT Presentation

1 / 71
About This Presentation
Title:

Sumathie Sundaresan

Description:

Title: K-Anonymity Author: Ge Ruan Created Date: 10/12/2006 5:13:14 PM Document presentation format: On-screen Show (4:3) Company: Department of Computer Science – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 72
Provided by: GeRu2
Category:

less

Transcript and Presenter's Notes

Title: Sumathie Sundaresan


1
Survey of Privacy Protection for Medical Data
  • Sumathie Sundaresan
  • Advisor Dr. Huiping Guo

2
Abstract
  • Expanded scientific knowledge, combined with the
    development of the net and widespread use of
    computers have increased the need for strong
    privacy protection for medical records. We have
    all heard stories of harassment that has resulted
    because of the lack of adequate privacy
    protection of medical records.
  • "...medical information is routinely shared with
    and viewed by third parties who are not involved
    in patient care .... The American Medical Records
    Association has identified twelve categories of
    information seekers outside of the health care
    industry who have access to health care files,
    including employers, government agencies, credit
    bureaus, insurers, educational institutions, and
    the media."

3
Methods
  • Generalization
  • k-anonymity
  • l-diversity
  • t-closeness
  • m-invariance
  • Personalized Privacy Preservation
  • Anatomy

4
Privacy preserving data publishing
Name Age Zipcode Disease
Bob 21 12000 dyspepsia
Alice 22 14000 bronchitis
Andy 24 18000 flu
David 23 25000 gastritis
Gary 41 20000 flu
Helen 36 27000 gastritis
Jane 37 33000 dyspepsia
Ken 40 35000 flu
Linda 43 26000 gastritis
Paul 52 33000 dyspepsia
Steve 56 34000 gastritis
  • Microdata

5
Classification of Attributes
  • Key Attribute
  • Name, Address, Cell Phone
  • which can uniquely identify an individual
    directly
  • Always removed before release.
  • Quasi-Identifier
  • 5-digit ZIP code,Birth date, gender
  • A set of attributes that can be potentially
    linked with external information to re-identify
    entities
  • 87 of the population in U.S. can be uniquely
    identified based on these attributes, according
    to the Census summary data in 1991.
  • Suppressed or generalized

6
Classification of Attributes(Contd)
  • Sensitive Attribute
  • Medical record, wage,etc.
  • Always released directly. These attributes is
    what the researchers need. It depends on the
    requirement.

7
Inference attack

Published table
Age Zipcode Disease
21 12000 dyspepsia
22 14000 bronchitis
24 18000 flu
23 25000 gastritis
41 20000 flu
36 27000 gastritis
37 33000 dyspepsia
40 35000 flu
43 26000 gastritis
52 33000 dyspepsia
56 34000 gastritis
An adversary
Name Age Zipcode
Bob 21 12000
Quasi-identifier (QI) attributes
8
Generalization
  • Transform the QI values into less specific forms

Age Zipcode Disease
21 12000 dyspepsia
22 14000 bronchitis
24 18000 flu
23 25000 gastritis
41 20000 flu
36 27000 gastritis
37 33000 dyspepsia
40 35000 flu
43 26000 gastritis
52 33000 dyspepsia
56 34000 gastritis
Age Zipcode Disease
21, 22 12k, 14k dyspepsia
21, 22 12k, 14k bronchitis
23, 24 18k, 25k flu
23, 24 18k, 25k gastritis
36, 41 20k, 27k flu
36, 41 20k, 27k gastritis
37, 43 26k, 35k dyspepsia
37, 43 26k, 35k flu
37, 43 26k, 35k gastritis
52, 56 33k, 34k dyspepsia
52, 56 33k, 34k gastritis
generalize
9
Generalization
  • Transform each QI value into a less specific form

A generalized table
Age Zipcode Disease
21, 22 12k, 14k dyspepsia
21, 22 12k, 14k bronchitis
23, 24 18k, 25k flu
23, 24 18k, 25k gastritis
36, 41 20k, 27k flu
36, 41 20k, 27k gastritis
37, 43 26k, 35k dyspepsia
37, 43 26k, 35k flu
37, 43 26k, 35k gastritis
52, 56 33k, 34k dyspepsia
52, 56 33k, 34k gastritis
An adversary
Name Age Zipcode
Bob 21 12000
10
K-Anonymity
  • Sweeny came up with a formal protection model
    named k-anonymity
  • What is K-Anonymity?
  • If the information for each person contained in
    the release cannot be distinguished from at least
    k-1 individuals whose information also appears in
    the release.
  • Example.
  • If you try to identify a man from a release, but
    the only information you have is his birth date
    and gender. There are k people meet the
    requirement. This is k-Anonymity.

11
(No Transcript)
12
Attacks Against K-Anonymity
  • Unsorted Matching Attack
  • This attack is based on the order in which tuples
    appear in the released table.
  • Solution
  • Randomly sort the tuples before releasing.

13
Attacks Against K-Anonymity(Contd)
  • k-Anonymity does not provide privacy if
  • Sensitive values in an equivalence class lack
    diversity
  • The attacker has background knowledge

A 3-anonymous patient table
Homogeneity Attack
Zipcode Age Disease
476 2 Heart Disease
476 2 Heart Disease
476 2 Heart Disease
4790 40 Flu
4790 40 Heart Disease
4790 40 Cancer
476 3 Heart Disease
476 3 Cancer
476 3 Cancer
Bob Bob
Zipcode Age
47678 27
Background Knowledge Attack
Carl Carl
Zipcode Age
47673 36
A. Machanavajjhala et al. l-Diversity Privacy
Beyond k-Anonymity. ICDE 2006
14
l-Diversity
  • Distinct l-diversity
  • Each equivalence class has at least l
    well-represented sensitive values
  • Limitation
  • Example.
  • In one equivalent class, there are ten tuples.
    In the Disease area, one of them is Cancer,
    one is Heart Disease and the remaining eight
    are Flu. This satisfies 3-diversity, but the
    attacker can still affirm that the target
    persons disease is Flu with the accuracy of
    70.

A. Machanavajjhala et al. l-Diversity Privacy
Beyond k-Anonymity. ICDE 2006
15
l-Diversity(Contd)
  • Entropy l-diversity
  • Each equivalence class not only must have enough
    different sensitive values, but also the
    different sensitive values must be distributed
    evenly enough.
  • Sometimes this maybe too restrictive. When some
    values are very common, the entropy of the entire
    table may be very low. This leads to the less
    conservative notion of l-diversity.
  • Recursive (c,l)-diversity
  • The most frequent value does not appear too
    frequently

A. Machanavajjhala et al. l-Diversity Privacy
Beyond k-Anonymity. ICDE 2006
16
Limitations of l-Diversity
l-diversity may be difficult and unnecessary to
achieve.
  • A single sensitive attribute
  • Two values HIV positive (1) and HIV negative
    (99)
  • Very different degrees of sensitivity
  • l-diversity is unnecessary to achieve
  • 2-diversity is unnecessary for an equivalence
    class that contains only negative records
  • l-diversity is difficult to achieve
  • Suppose there are 10000 records in total
  • To have distinct 2-diversity, there can be at
    most 100001100 equivalence classes

17
Limitations of l-Diversity(Contd)
l-diversity is insufficient to prevent attribute
disclosure.
Skewness Attack
  • Two sensitive values
  • HIV positive (1) and HIV negative (99)
  • Serious privacy risk
  • Consider an equivalence class that contains an
    equal number of positive records and negative
    records
  • l-diversity does not differentiate
  • Equivalence class 1 49 positive 1 negative
  • Equivalence class 2 1 positive 49 negative

l-diversity does not consider the overall
distribution of sensitive values
18
Limitations of l-Diversity(Contd)
l-diversity is insufficient to prevent attribute
disclosure.
A 3-diverse patient table
Similarity Attack
Zipcode Age Salary Disease
476 2 3K Gastric Ulcer
476 2 4K Gastritis
476 2 5K Stomach Cancer
4790 40 6K Gastritis
4790 40 11K Flu
4790 40 8K Bronchitis
476 3 7K Bronchitis
476 3 9K Pneumonia
476 3 10K Stomach Cancer
Bob Bob
Zip Age
47678 27
  • Conclusion
  • Bobs salary is in 3k,5k, which is relative
    low.
  • Bob has some stomach-related disease.

l-diversity does not consider semantic meanings
of sensitive values
19
t-Closeness A New Privacy Measure
  • Rationale

A completely generalized table
Age Zipcode Gender Disease
Flu
Heart Disease
Cancer
. . . . . . . . . . . .
Gastritis
Belief Knowledge


B0
ExternalKnowledge
B1
Overall distribution Q of sensitive values
20
t-Closeness A New Privacy Measure
  • Rationale

A released table
Age Zipcode Gender Disease
2 479 Male Flu
2 479 Male Heart Disease
2 479 Male Cancer
. . . . . . . . . . . .
50 4766 Gastritis
Belief Knowledge



B0
ExternalKnowledge
B1
Overall distribution Q of sensitive values
B2
Distribution Pi of sensitive values in each
equi-class
21
t-Closeness A New Privacy Measure
  • Rationale
  • Observations
  • Q should be public
  • Knowledge gain in two parts
  • Whole population (from B0 to B1)
  • Specific individuals (from B1 to B2)
  • We bound knowledge gain between B1 and B2 instead
  • Principle
  • The distance between Q and Pi should be bounded
    by a threshold t.

Belief Knowledge



B0
ExternalKnowledge
B1
Overall distribution Q of sensitive values
B2
Distribution Pi of sensitive values in each
equi-class
22
How to calculate EMD
  • EMD for numerical attributes
  • Ordered-distance is a metric
  • Non-negative, symmetry, triangle inequality
  • Let ripi-qi, then DP,Q is calculated as

23
Earth Movers Distance
  • Example
  • 3k,4k,5k and 3k,4k,5k,6k,7k,8k,9k,10k,11k
  • Move 1/9 probability for each of the following
    pairs
  • 3k-gt5k,3k-gt4k cost 1/9(21)/8
  • 4k-gt8k,4k-gt7k,4k-gt6k cost 1/9(432)/8
  • 5k-gt11k,5k-gt10k,5k-gt9k cost 1/9(564)/8
  • Total cost 1/927/80.375
  • With P26k,8k,11k , we can get the total cost
    is 0.167 lt 0.375. This make more sense than the
    other two distance calculation method.

24
Motivating Example
  • A hospital keeps track of the medical records
    collected in the last three months.
  • The microdata table T(1), and its generalization
    T(1), published in Apr. 2007.

Name Age Zipcode Disease
Bob 21 12000 dyspepsia
Alice 22 14000 bronchitis
Andy 24 18000 flu
David 23 25000 gastritis
Gary 41 20000 flu
Helen 36 27000 gastritis
Jane 37 33000 dyspepsia
Ken 40 35000 flu
Linda 43 26000 gastritis
Paul 52 33000 dyspepsia
Steve 56 34000 gastritis
G. ID Age Zipcode Disease
1 21, 22 12k, 14k dyspepsia
1 21, 22 12k, 14k bronchitis
2 23, 24 18k, 25k flu
2 23, 24 18k, 25k gastritis
3 36, 41 20k, 27k flu
3 36, 41 20k, 27k gastritis
4 37, 43 26k, 35k dyspepsia
4 37, 43 26k, 35k flu
4 37, 43 26k, 35k gastritis
5 52, 56 33k, 34k dyspepsia
5 52, 56 33k, 34k gastritis
2-diverse Generalization T(1)
Microdata T(1)
25
Motivating Example
  • Bob was hospitalized in Mar. 2007

G. ID Age Zipcode Disease
1 21, 22 12k, 14k dyspepsia
1 21, 22 12k, 14k bronchitis
2 23, 24 18k, 25k flu
2 23, 24 18k, 25k gastritis
3 36, 41 20k, 27k flu
3 36, 41 20k, 27k gastritis
4 37, 43 26k, 35k dyspepsia
4 37, 43 26k, 35k flu
4 37, 43 26k, 35k gastritis
5 52, 56 33k, 34k dyspepsia
5 52, 56 33k, 34k gastritis
Name Age Zipcode
Bob 21 12000
2-diverse Generalization T(1)
26
Motivating Example
  • One month later, in May 2007

Name Age Zipcode Disease
Bob 21 12000 dyspepsia
Alice 22 14000 bronchitis
Andy 24 18000 flu
David 23 25000 gastritis
Gary 41 20000 flu
Helen 36 27000 gastritis
Jane 37 33000 dyspepsia
Ken 40 35000 flu
Linda 43 26000 gastritis
Paul 52 33000 dyspepsia
Steve 56 34000 gastritis
Microdata T(1)
27
Motivating Example
  • One month later, in May 2007
  • Some obsolete tuples are deleted from the
    microdata.

Name Age Zipcode Disease
Bob 21 12000 dyspepsia
Alice 22 14000 bronchitis
Andy 24 18000 flu
David 23 25000 gastritis
Gary 41 20000 flu
Helen 36 27000 gastritis
Jane 37 33000 dyspepsia
Ken 40 35000 flu
Linda 43 26000 gastritis
Paul 52 33000 dyspepsia
Steve 56 34000 gastritis
Microdata T(1)
28
Motivating Example
  • Bobs tuple stays.

Name Age Zipcode Disease
Bob 21 12000 dyspepsia
David 23 25000 gastritis
Gary 41 20000 flu
Jane 37 33000 dyspepsia
Linda 43 26000 gastritis
Steve 56 34000 gastritis
Microdata T(1)
29
Motivating Example
  • Some new records are inserted.

Name Age Zipcode Disease
Bob 21 12000 dyspepsia
David 23 25000 gastritis
Emily 25 21000 flu
Jane 37 33000 dyspepsia
Linda 43 26000 gastritis
Gary 41 20000 flu
Mary 46 30000 gastritis
Ray 54 31000 dyspepsia
Steve 56 34000 gastritis
Tom 60 44000 gastritis
Vince 65 36000 flu
Microdata T(2)
30
Motivating Example
  • The hospital published T(2).

G. ID Age Zipcode Disease
1 21, 23 12k, 25k dyspepsia
1 21, 23 12k, 25k gastritis
2 25, 43 21k, 33k flu
2 25, 43 21k, 33k dyspepsia
3 25, 43 21k, 33k gastritis
3 41, 46 20k, 30k flu
4 41, 46 20k, 30k gastritis
4 54, 56 31k, 34k dyspepsia
4 54, 56 31k, 34k gastritis
5 60, 65 36k, 44k gastritis
5 60, 65 36k, 44k flu
Name Age Zipcode Disease
Bob 21 12000 dyspepsia
David 23 25000 gastritis
Emily 25 21000 flu
Jane 37 33000 dyspepsia
Linda 43 26000 gastritis
Gary 41 20000 flu
Mary 46 30000 gastritis
Ray 54 31000 dyspepsia
Steve 56 34000 gastritis
Tom 60 44000 gastritis
Vince 65 36000 flu
2-diverse Generalization T(2)
Microdata T(2)
31
Motivating Example
  • Consider the previous adversary.

G. ID Age Zipcode Disease
1 21, 23 12k, 25k dyspepsia
1 21, 23 12k, 25k gastritis
2 25, 43 21k, 33k flu
2 25, 43 21k, 33k dyspepsia
3 25, 43 21k, 33k gastritis
3 41, 46 20k, 30k flu
4 41, 46 20k, 30k gastritis
4 54, 56 31k, 34k dyspepsia
4 54, 56 31k, 34k gastritis
5 60, 65 36k, 44k gastritis
5 60, 65 36k, 44k flu
Name Age Zipcode
Bob 21 12000
2-diverse Generalization T(2)
32
Motivating Example
  • What the adversary learns from T(1).
  • What the adversary learns from T(2).
  • So Bob must have contracted dyspepsia!
  • A new generalization principle is needed.

G. ID Age Zipcode Disease
1 21, 22 12k, 14k dyspepsia
1 21, 22 12k, 14k bronchitis

Name Age Zipcode
Bob 21 12000
G. ID Age Zipcode Disease
1 21, 23 12k, 25k dyspepsia
1 21, 23 12k, 25k gastritis

Name Age Zipcode
Bob 21 12000
33
The critical absence phenomenon
Microdata T(2)
What the adversary learns from T(1)
Name Age Zipcode Disease
Bob 21 12000 dyspepsia
David 23 25000 gastritis
Emily 25 21000 flu
Jane 37 33000 dyspepsia
Linda 43 26000 gastritis
Gary 41 20000 flu
Mary 46 30000 gastritis
Ray 54 31000 dyspepsia
Steve 56 34000 gastritis
Tom 60 44000 gastritis
Vince 65 36000 flu
Name Age Zipcode
Bob 21 12000
G. ID Age Zipcode Disease
1 21, 22 12k, 14k dyspepsia
1 21, 22 12k, 14k bronchitis
  • We refer to such phenomenon as the critical
    absence phenomenon
  • A new generalization method is needed.

34
Name Group-ID Age Zipcode Disease
Bob 1 21, 22 12k, 14k dyspepsia
c1 1 21, 22 12k, 14k bronchitis
David 2 23, 25 21k, 25k gastritis
Emily 2 23, 25 21k, 25k flu
Jane 3 37, 43 26k, 33k dyspepsia
c2 3 37, 43 26k, 33k flu
Linda 3 37, 43 26k, 33k gastritis
Gary 4 41, 46 20k, 30k flu
Mary 4 41, 46 20k, 30k gastritis
Ray 5 54, 56 31k, 34k dyspepsia
Steve 5 54, 56 31k, 34k gastritis
Tom 6 60, 65 36k, 44k gastritis
Vince 6 60, 65 36k, 44k flu
Name Age Zipcode Disease
Bob 21 12000 dyspepsia
David 23 25000 gastritis
Emily 25 21000 flu
Jane 37 33000 dyspepsia
Linda 43 26000 gastritis
Gary 41 20000 flu
Mary 46 30000 gastritis
Ray 54 31000 dyspepsia
Steve 56 34000 gastritis
Tom 60 44000 gastritis
Vince 65 36000 flu
Microdata T(2)
Counterfeited generalization T(2)
Group-ID Count
1 1
3 1
The auxiliary relation R(2) for T(2)
35
Name G.ID Age Zipcode Disease
Bob 1 21, 22 12k, 14k dyspepsia
c1 1 21, 22 12k, 14k bronchitis
David 2 23, 25 21k, 25k gastritis
Emily 2 23, 25 21k, 25k flu
Jane 3 37, 43 26k, 33k dyspepsia
c2 3 37, 43 26k, 33k flu
Linda 3 37, 43 26k, 33k gastritis
Gary 4 41, 46 20k, 30k flu
Mary 4 41, 46 20k, 30k gastritis
Ray 5 54, 56 31k, 34k dyspepsia
Steve 5 54, 56 31k, 34k gastritis
Tom 6 60, 65 36k, 44k gastritis
Vince 6 60, 65 36k, 44k flu
Name G.ID Age Zipcode Disease
Bob 1 21, 22 12k, 14k dyspepsia
Alice 1 21, 22 12k, 14k bronchitis
Andy 2 23, 24 18k, 25k flu
David 2 23, 24 18k, 25k gastritis
Gary 3 36, 41 20k, 27k flu
Helen 3 36, 41 20k, 27k gastritis
Jane 4 37, 43 26k, 35k dyspepsia
Ken 4 37, 43 26k, 35k flu
Linda 4 37, 43 26k, 35k gastritis
Paul 5 52, 56 33k, 34k dyspepsia
Steve 5 52, 56 33k, 34k gastritis
Counterfeited Generalization T(2)
Generalization T(1)
Group-ID Count
1 1
3 1
Name Age Zipcode
Bob 21 12000
The auxiliary relation R(2) for T(2)
36
m-uniqueness
  • A generalized table T(j) is m-unique, if and
    only if
  • each QI-group in T(j) contains at least m
    tuples
  • all tuples in the same QI-group have different
    sensitive values.

G. ID Age Zipcode Disease
1 21, 22 12k, 14k dyspepsia
1 21, 22 12k, 14k bronchitis
2 23, 24 18k, 25k flu
2 23, 24 18k, 25k gastritis
3 36, 41 20k, 27k flu
3 36, 41 20k, 27k gastritis
4 37, 43 26k, 35k dyspepsia
4 37, 43 26k, 35k flu
4 37, 43 26k, 35k gastritis
5 52, 56 33k, 34k dyspepsia
5 52, 56 33k, 34k gastritis
A 2-unique generalized table
37
Signature
  • The signature of Bob in T(1) is dyspepsia,
    bronchitis
  • The signature of Jane in T(1) is dyspepsia,
    flu, gastritis

Name G.ID Age Zipcode Disease
Bob 1 21, 22 12k, 14k dyspepsia
Alice 1 21, 22 12k, 14k bronchitis

Jane 4 37, 43 26k, 35k dyspepsia
Ken 4 37, 43 26k, 35k flu
Linda 4 37, 43 26k, 35k gastritis

T(1)
38
The m-invariance principle
  • A sequence of generalized tables T(1), , T(n)
    is m-invariant, if and only if
  • T(1), , T(n) are m-unique, and
  • each individual has the same signature in every
    generalized table s/he is involved.

39
  • A sequence of generalized tables T(1), , T(n)
    is m-invariant, if and only if
  • T(1), , T(n) are m-unique, and
  • each individual has the same signature in every
    generalized table s/he is involved.

Name G.ID Age Zipcode Disease
Bob 1 21, 22 12k, 14k dyspepsia
c1 1 21, 22 12k, 14k bronchitis
David 2 23, 25 21k, 25k gastritis
Emily 2 23, 25 21k, 25k flu
Jane 3 37, 43 26k, 33k dyspepsia
c2 3 37, 43 26k, 33k flu
Linda 3 37, 43 26k, 33k gastritis
Gary 4 41, 46 20k, 30k flu
Mary 4 41, 46 20k, 30k gastritis
Ray 5 54, 56 31k, 34k dyspepsia
Steve 5 54, 56 31k, 34k gastritis
Tom 6 60, 65 36k, 44k gastritis
Vince 6 60, 65 36k, 44k flu
Name G.ID Age Zipcode Disease
Bob 1 21, 22 12k, 14k dyspepsia
Alice 1 21, 22 12k, 14k bronchitis
Andy 2 23, 24 18k, 25k flu
David 2 23, 24 18k, 25k gastritis
Gary 3 36, 41 20k, 27k flu
Helen 3 36, 41 20k, 27k gastritis
Jane 4 37, 43 26k, 35k dyspepsia
Ken 4 37, 43 26k, 35k flu
Linda 4 37, 43 26k, 35k gastritis
Paul 5 52, 56 33k, 34k dyspepsia
Steve 5 52, 56 33k, 34k gastritis
Generalization T(2)
Generalization T(1)
40
  • A sequence of generalized tables T(1), , T(n)
    is m-invariant, if and only if
  • T(1), , T(n) are m-unique, and
  • each individual has the same signature in every
    generalized table s/he is involved.

Name G.ID Age Zipcode Disease
Bob 1 21, 22 12k, 14k dyspepsia
c1 1 21, 22 12k, 14k bronchitis
David 2 23, 25 21k, 25k gastritis
Emily 2 23, 25 21k, 25k flu
Jane 3 37, 43 26k, 33k dyspepsia
c2 3 37, 43 26k, 33k flu
Linda 3 37, 43 26k, 33k gastritis
Gary 4 41, 46 20k, 30k flu
Mary 4 41, 46 20k, 30k gastritis
Ray 5 54, 56 31k, 34k dyspepsia
Steve 5 54, 56 31k, 34k gastritis
Tom 6 60, 65 36k, 44k gastritis
Vince 6 60, 65 36k, 44k flu
Name G.ID Age Zipcode Disease
Bob 1 21, 22 12k, 14k dyspepsia
Alice 1 21, 22 12k, 14k bronchitis
Andy 2 23, 24 18k, 25k flu
David 2 23, 24 18k, 25k gastritis
Gary 3 36, 41 20k, 27k flu
Helen 3 36, 41 20k, 27k gastritis
Jane 4 37, 43 26k, 35k dyspepsia
Ken 4 37, 43 26k, 35k flu
Linda 4 37, 43 26k, 35k gastritis
Paul 5 52, 56 33k, 34k dyspepsia
Steve 5 52, 56 33k, 34k gastritis
Generalization T(2)
Generalization T(1)
41
  • A sequence of generalized tables T(1), , T(n)
    is m-invariant, if and only if
  • T(1), , T(n) are m-unique, and
  • each individual has the same signature in every
    generalized table s/he is involved.

Name G.ID Age Zipcode Disease
Bob 1 21, 22 12k, 14k dyspepsia
c1 1 21, 22 12k, 14k bronchitis
David 2 23, 25 21k, 25k gastritis
Emily 2 23, 25 21k, 25k flu
Jane 3 37, 43 26k, 33k dyspepsia
c2 3 37, 43 26k, 33k flu
Linda 3 37, 43 26k, 33k gastritis
Gary 4 41, 46 20k, 30k flu
Mary 4 41, 46 20k, 30k gastritis
Ray 5 54, 56 31k, 34k dyspepsia
Steve 5 54, 56 31k, 34k gastritis
Tom 6 60, 65 36k, 44k gastritis
Vince 6 60, 65 36k, 44k flu
Name G.ID Age Zipcode Disease
Bob 1 21, 22 12k, 14k dyspepsia
Alice 1 21, 22 12k, 14k bronchitis
Andy 2 23, 24 18k, 25k flu
David 2 23, 24 18k, 25k gastritis
Gary 3 36, 41 20k, 27k flu
Helen 3 36, 41 20k, 27k gastritis
Jane 4 37, 43 26k, 35k dyspepsia
Ken 4 37, 43 26k, 35k flu
Linda 4 37, 43 26k, 35k gastritis
Paul 5 52, 56 33k, 34k dyspepsia
Steve 5 52, 56 33k, 34k gastritis
Generalization T(2)
Generalization T(1)
42
Motivation 1 Personalization
  • Andy does not want anyone to know that he had a
    stomach problem
  • Sarah does not mind at all if others find out
    that she had flu

A 2-diverse table
An external database
Name Age Sex Zipcode
Andy 4 M 12000
Bill 5 M 14000
Ken 6 M 18000
Nash 9 M 19000
Mike 7 M 17000
Alice 12 F 22000
Betty 19 F 24000
Linda 21 F 33000
Jane 25 F 34000
Sarah 28 F 37000
Mary 56 F 58000
Age Sex Zipcode Disease
1, 5 M 10001, 15000 gastric ulcer
1, 5 M 10001, 15000 dyspepsia
6, 10 M 15001, 20000 pneumonia
6, 10 M 15001, 20000 bronchitis
11, 20 F 20001, 25000 flu
11, 20 F 20001, 25000 pneumonia
21, 60 F 30001, 60000 gastritis
21, 60 F 30001, 60000 gastritis
21, 60 F 30001, 60000 flu
21, 60 F 30001, 60000 flu
43
Motivation 2 SA generalization
  • How many female patients are there with age above
    30?
  • 4 (60 30 ) / (60 20 ) 3
  • Real answer 1

An external database
A generalized table
Name Age Sex Zipcode
Andy 4 M 12000
Bill 5 M 14000
Ken 6 M 18000
Nash 9 M 19000
Mike 7 M 17000
Alice 12 F 22000
Betty 19 F 24000
Linda 21 F 33000
Jane 25 F 34000
Sarah 28 F 37000
Mary 56 F 58000
Age Sex Zipcode Disease
1, 5 M 10001, 15000 gastric ulcer
1, 5 M 10001, 15000 dyspepsia
6, 10 M 15001, 20000 pneumonia
6, 10 M 15001, 20000 bronchitis
11, 20 F 20001, 25000 flu
11, 20 F 20001, 25000 pneumonia
21, 60 F 30001, 60000 gastritis
21, 60 F 30001, 60000 gastritis
21, 60 F 30001, 60000 flu
21, 60 F 30001, 60000 flu
44
Motivation 2 SA generalization (cont.)
  • Generalization of the sensitive attribute is
    beneficial in this case

A better generalized table
An external database
Age Sex Zipcode Disease
1, 5 M 10001, 15000 gastric ulcer
1, 5 M 10001, 15000 dyspepsia
6, 10 M 15001, 20000 pneumonia
6, 10 M 15001, 20000 bronchitis
11, 20 F 20001, 25000 flu
11, 20 F 20001, 25000 pneumonia
21, 30 F 30001, 40000 gastritis
21, 30 F 30001, 40000 gastritis
21, 30 F 30001, 40000 flu
56 F 58000 respiratory infection
Name Age Sex Zipcode
Andy 4 M 12000
Bill 5 M 14000
Ken 6 M 18000
Nash 9 M 19000
Mike 7 M 17000
Alice 12 F 22000
Betty 19 F 24000
Linda 21 F 33000
Jane 25 F 34000
Sarah 28 F 37000
Mary 56 F 58000
45
Personalized anonymity
  • We propose
  • a mechanism to capture personalized privacy
    requirements
  • criteria for measuring the degree of security
    provided by a generalized table

46
Guarding node
  • Andy does not want anyone to know that he had a
    stomach problem
  • He can specify stomach disease as the guarding
    node for his tuple
  • The data publisher should prevent an adversary
    from associating Andy with stomach disease

Name Age Sex Zipcode Disease guarding node
Andy 4 M 12000 gastric ulcer stomach disease
47
Guarding node
  • Sarah is willing to disclose her exact symptom
  • She can specify Ø as the guarding node for her
    tuple

Name Age Sex Zipcode Disease guarding node
Sarah 28 F 37000 flu Ø
48
Guarding node
  • Bill does not have any special preference
  • He can specify the guarding node for his tuple as
    the same with his sensitive value

Name Age Sex Zipcode Disease guarding node
Bill 5 M 14000 dyspepsia dyspepsia
49
A personalized approach
Name Age Sex Zipcode Disease guarding node
Andy 4 M 12000 gastric ulcer stomach disease
Bill 5 M 14000 dyspepsia dyspepsia
Ken 6 M 18000 pneumonia respiratory infection
Nash 9 M 19000 bronchitis bronchitis
Alice 12 F 22000 flu flu
Betty 19 F 24000 pneumonia pneumonia
Linda 21 F 33000 gastritis gastritis
Jane 25 F 34000 gastritis Ø
Sarah 28 F 37000 flu Ø
Mary 56 F 58000 flu flu
50
Personalized anonymity
Name Age Sex Zipcode Disease guarding node
Andy 4 M 12000 gastric ulcer stomach disease
Bill 5 M 14000 dyspepsia dyspepsia
Ken 6 M 18000 pneumonia respiratory infection
Nash 9 M 19000 bronchitis bronchitis
Alice 12 F 22000 flu flu
Betty 19 F 24000 pneumonia pneumonia
Linda 21 F 33000 gastritis gastritis
Jane 25 F 34000 gastritis Ø
Sarah 28 F 37000 flu Ø
Mary 56 F 58000 flu flu
  • A table satisfies personalized anonymity with a
    parameter pbreach
  • Iff no adversary can breach the privacy
    requirement of any tuple with a probability above
    pbreach
  • If pbreach 0.3, then any adversary should have
    no more than 30 probability to find out that
  • Andy had a stomach disease
  • Bill had dyspepsia
  • etc

51
Personalized anonymity
  • Personalized anonymity with respect to a
    predefined parameter pbreach
  • an adversary can breach the privacy requirement
    of any tuple with a probability at most pbreach
  • We need a method for calculating the breach
    probabilities

Age Sex Zipcode Disease
1, 10 M 10001, 20000 gastric ulcer
1, 10 M 10001, 20000 dyspepsia
1, 10 M 10001, 20000 pneumonia
1, 10 M 10001, 20000 bronchitis
11, 20 F 20001, 25000 flu
11, 20 F 20001, 25000 pneumonia
21 F 33000 stomach disease
25 F 34000 gastritis
28 F 37000 flu
56 F 58000 respiratory infection
What is the probability that Andy had some
stomach problem?
52
Combinatorial reconstruction
  • Assumptions
  • the adversary has no prior knowledge about each
    individual
  • every individual involved in the microdata also
    appears in the external database

53
Combinatorial reconstruction
  • Andy does not want anyone to know that he had
    some stomach problem
  • What is the probability that the adversary can
    find out that Andy had a stomach disease?

Name Age Sex Zipcode
Andy 4 M 12000
Bill 5 M 14000
Ken 6 M 18000
Nash 9 M 19000
Mike 7 M 17000
Alice 12 F 22000
Betty 19 F 24000
Linda 21 F 33000
Jane 25 F 34000
Sarah 28 F 37000
Mary 56 F 58000
Age Sex Zipcode Disease
1, 10 M 10001, 20000 gastric ulcer
1, 10 M 10001, 20000 dyspepsia
1, 10 M 10001, 20000 pneumonia
1, 10 M 10001, 20000 bronchitis
11, 20 F 20001, 25000 flu
11, 20 F 20001, 25000 pneumonia
21 F 33000 stomach disease
25 F 34000 gastritis
28 F 37000 flu
56 F 58000 respiratory infection
54
Combinatorial reconstruction (cont.)
  • Can each individual appear more than once?
  • No the primary case
  • Yes the non-primary case
  • Some possible reconstructions

the primary case
the non-primary case
Andy
Bill
Ken
Nash
Mike
Andy
Bill
Ken
Nash
Mike
gastric ulcer
dyspepsia
pneumonia
bronchitis
gastric ulcer
dyspepsia
pneumonia
bronchitis
55
Combinatorial reconstruction (cont.)
  • Can each individual appear more than once?
  • No the primary case
  • Yes the non-primary case
  • Some possible reconstructions

the primary case
the non-primary case
Andy
Bill
Ken
Nash
Mike
Andy
Bill
Ken
Nash
Mike
gastric ulcer
dyspepsia
pneumonia
bronchitis
gastric ulcer
dyspepsia
pneumonia
bronchitis
56
Breach probability (primary)
Andy
Bill
Ken
Nash
Mike
gastric ulcer
dyspepsia
pneumonia
bronchitis
  • Totally 120 possible reconstructions
  • If Andy is associated with a stomach disease in
    nb reconstructions
  • The probability that the adversary should
    associate Andy with some stomach problem is nb /
    120
  • Andy is associated with
  • gastric ulcer in 24 reconstructions
  • dyspepsia in 24 reconstructions
  • gastritis in 0 reconstructions
  • nb 48
  • The breach probability for Andys tuple is 48 /
    120 2 / 5

57
Breach probability (non-primary)
Andy
Bill
Ken
Nash
Mike
gastric ulcer
dyspepsia
pneumonia
bronchitis
  • Totally 625 possible reconstructions
  • Andy is associated with gastric ulcer or
    dyspepsia or gastritis in 225 reconstructions
  • nb 225
  • The breach probability for Andys tuple is
  • 225 / 625 9 / 25

58
Defect of generalization
  • Query A SELECT COUNT() from Unknown-Microdata
  • WHERE Disease pneumonia AND Age in 0, 30
  • AND Zipcode in 10001, 20000

Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 pneumonia
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 gastritis
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 bronchitis
  • Estimated answer 2 p, where p is the
    probability that each of the two tuples satisfies
    the query conditions

59
Defect of generalization (cont.)
  • Query A SELECT COUNT() from Unknown-Microdata
  • WHERE Disease pneumonia AND Age in 0, 30
  • AND Zipcode in 10001, 20000
  • p Area( R1 n Q ) / Area( R1 ) 0.05
  • Estimated answer for query A 2 p 0.1

Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 pneumonia
60
Defect of generalization (cont.)
  • Query A SELECT COUNT() from Unknown-Microdata
  • WHERE Disease pneumonia AND Age in 0, 30
  • AND Zipcode in 10001, 20000
  • Estimated answer from the generalized table 0.1
  • The exact answer should be 1

Name Age Sex Zipcode Disease
Bob 23 M 11000 pneumonia
Ken 27 M 13000 dyspepsia
Peter 35 M 59000 dyspepsia
Sam 59 M 12000 pneumonia
Jane 61 F 54000 flu
Linda 65 F 25000 gastritis
Alice 65 F 25000 flu
Mandy 70 F 30000 bronchitis
61
Basic Idea of Anatomy
  • For a given microdata table, Anatomy releases a
    quasi-identifier table (QIT) and a sensitive
    table (ST)

Age Sex Zipcode Disease
23 M 11000 pneumonia
27 M 13000 dyspepsia
35 M 59000 dyspepsia
59 M 12000 pneumonia
61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
Sensitive Table (ST)
Quasi-identifier Table (QIT)
microdata
62
Basic Idea of Anatomy (cont.)
  • 1. Select a partition of the tuples

Age Sex Zipcode Disease

23 M 11000 pneumonia
27 M 13000 dyspepsia
35 M 59000 dyspepsia
59 M 12000 pneumonia

61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis
QI group 1
QI group 2
a 2-diverse partition
63
Basic Idea of Anatomy (cont.)
  • 2. Generate a quasi-idnetifier table (QIT) and a
    sensitive table (ST) based on the selected
    partition

Disease

pneumonia
dyspepsia
dyspepsia
pneumonia

flu
gastritis
flu
bronchitis
Age Sex Zipcode

23 M 11000
27 M 13000
35 M 59000
59 M 12000

61 F 54000
65 F 25000
65 F 25000
70 F 30000
group 1
group 2
quasi-identifier table (QIT)
sensitive table (ST)
64
Basic Idea of Anatomy (cont.)
  • 2. Generate a quasi-idnetifier table (QIT) and a
    sensitive table (ST) based on the selected
    partition

Group-ID Disease

1 pneumonia
1 dyspepsia
1 dyspepsia
1 pneumonia

2 flu
2 gastritis
2 flu
2 bronchitis
Age Sex Zipcode Group-ID

23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1

61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
quasi-identifier table (QIT)
sensitive table (ST)
65
Basic Idea of Anatomy (cont.)
  • 2. Generate a quasi-idnetifier table (QIT) and a
    sensitive table (ST) based on the selected
    partition

Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
sensitive table (ST)
quasi-identifier table (QIT)
66
Privacy Preservation
  • From a pair of QIT and ST generated from an
    l-diverse partition, the adversary can infer the
    sensitive value of each individual with
    confidence at most 1/l

Name Age Sex Zipcode
Bob 23 M 11000
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
sensitive table (ST)
quasi-identifier table (QIT)
67
Accuracy of Data Analysis
  • Query A SELECT COUNT() from Unknown-Microdata
  • WHERE Disease pneumonia AND Age in 0, 30
  • AND Zipcode in 10001, 20000

Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
sensitive table (ST)
quasi-identifier table (QIT)
68
Accuracy of Data Analysis (cont.)
  • Query A SELECT COUNT() from Unknown-Microdata
  • WHERE Disease pneumonia AND Age in 0, 30
  • AND Zipcode in 10001, 20000
  • 2 patients have contracted pneumonia
  • 2 out of 4 patients satisfies the query condition
    on Age and Zipcode
  • Estimated answer for query A 2 2 / 4 1,
    which is also the actual result from the original
    microdata

Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
t1t2 t3 t4
69
Conclusion
  • Limitations of l-diversity
  • l-diversity is difficult and unnecessary to
    achieve
  • l-diversity is insufficient in preventing
    attribute disclosure
  • t-Closeness as a new privacy measure
  • The overall distribution of sensitive values
    should be public information
  • The separation of the knowledge gain
  • EMD to measure distance
  • EMD captures semantic distance well
  • Simple formulas for three ground distances

70
Conclusions
  • m-invariant table support republication of
    dynamic datasets
  • Guarding nodes allow individuals to describe
    their privacy requirements better
  • Anatomy outperforms generalization by allowing
    much more accurate data analysis on the published
    data.

71
Questions?
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com