Title: Sumathie Sundaresan
1Survey of Privacy Protection for Medical Data
- Sumathie Sundaresan
- Advisor Dr. Huiping Guo
2Abstract
- Expanded scientific knowledge, combined with the
development of the net and widespread use of
computers have increased the need for strong
privacy protection for medical records. We have
all heard stories of harassment that has resulted
because of the lack of adequate privacy
protection of medical records. - "...medical information is routinely shared with
and viewed by third parties who are not involved
in patient care .... The American Medical Records
Association has identified twelve categories of
information seekers outside of the health care
industry who have access to health care files,
including employers, government agencies, credit
bureaus, insurers, educational institutions, and
the media."
3Methods
- Generalization
- k-anonymity
- l-diversity
- t-closeness
- m-invariance
- Personalized Privacy Preservation
- Anatomy
4Privacy preserving data publishing
Name Age Zipcode Disease
Bob 21 12000 dyspepsia
Alice 22 14000 bronchitis
Andy 24 18000 flu
David 23 25000 gastritis
Gary 41 20000 flu
Helen 36 27000 gastritis
Jane 37 33000 dyspepsia
Ken 40 35000 flu
Linda 43 26000 gastritis
Paul 52 33000 dyspepsia
Steve 56 34000 gastritis
5Classification of Attributes
- Key Attribute
- Name, Address, Cell Phone
- which can uniquely identify an individual
directly - Always removed before release.
- Quasi-Identifier
- 5-digit ZIP code,Birth date, gender
- A set of attributes that can be potentially
linked with external information to re-identify
entities - 87 of the population in U.S. can be uniquely
identified based on these attributes, according
to the Census summary data in 1991. - Suppressed or generalized
6Classification of Attributes(Contd)
- Sensitive Attribute
- Medical record, wage,etc.
- Always released directly. These attributes is
what the researchers need. It depends on the
requirement.
7Inference attack
Published table
Age Zipcode Disease
21 12000 dyspepsia
22 14000 bronchitis
24 18000 flu
23 25000 gastritis
41 20000 flu
36 27000 gastritis
37 33000 dyspepsia
40 35000 flu
43 26000 gastritis
52 33000 dyspepsia
56 34000 gastritis
An adversary
Name Age Zipcode
Bob 21 12000
Quasi-identifier (QI) attributes
8Generalization
- Transform the QI values into less specific forms
-
Age Zipcode Disease
21 12000 dyspepsia
22 14000 bronchitis
24 18000 flu
23 25000 gastritis
41 20000 flu
36 27000 gastritis
37 33000 dyspepsia
40 35000 flu
43 26000 gastritis
52 33000 dyspepsia
56 34000 gastritis
Age Zipcode Disease
21, 22 12k, 14k dyspepsia
21, 22 12k, 14k bronchitis
23, 24 18k, 25k flu
23, 24 18k, 25k gastritis
36, 41 20k, 27k flu
36, 41 20k, 27k gastritis
37, 43 26k, 35k dyspepsia
37, 43 26k, 35k flu
37, 43 26k, 35k gastritis
52, 56 33k, 34k dyspepsia
52, 56 33k, 34k gastritis
generalize
9Generalization
- Transform each QI value into a less specific form
-
A generalized table
Age Zipcode Disease
21, 22 12k, 14k dyspepsia
21, 22 12k, 14k bronchitis
23, 24 18k, 25k flu
23, 24 18k, 25k gastritis
36, 41 20k, 27k flu
36, 41 20k, 27k gastritis
37, 43 26k, 35k dyspepsia
37, 43 26k, 35k flu
37, 43 26k, 35k gastritis
52, 56 33k, 34k dyspepsia
52, 56 33k, 34k gastritis
An adversary
Name Age Zipcode
Bob 21 12000
10K-Anonymity
- Sweeny came up with a formal protection model
named k-anonymity - What is K-Anonymity?
- If the information for each person contained in
the release cannot be distinguished from at least
k-1 individuals whose information also appears in
the release. - Example.
- If you try to identify a man from a release, but
the only information you have is his birth date
and gender. There are k people meet the
requirement. This is k-Anonymity.
11(No Transcript)
12Attacks Against K-Anonymity
- Unsorted Matching Attack
- This attack is based on the order in which tuples
appear in the released table. - Solution
- Randomly sort the tuples before releasing.
13Attacks Against K-Anonymity(Contd)
- k-Anonymity does not provide privacy if
- Sensitive values in an equivalence class lack
diversity - The attacker has background knowledge
A 3-anonymous patient table
Homogeneity Attack
Zipcode Age Disease
476 2 Heart Disease
476 2 Heart Disease
476 2 Heart Disease
4790 40 Flu
4790 40 Heart Disease
4790 40 Cancer
476 3 Heart Disease
476 3 Cancer
476 3 Cancer
Bob Bob
Zipcode Age
47678 27
Background Knowledge Attack
Carl Carl
Zipcode Age
47673 36
A. Machanavajjhala et al. l-Diversity Privacy
Beyond k-Anonymity. ICDE 2006
14l-Diversity
- Distinct l-diversity
- Each equivalence class has at least l
well-represented sensitive values - Limitation
- Example.
- In one equivalent class, there are ten tuples.
In the Disease area, one of them is Cancer,
one is Heart Disease and the remaining eight
are Flu. This satisfies 3-diversity, but the
attacker can still affirm that the target
persons disease is Flu with the accuracy of
70.
A. Machanavajjhala et al. l-Diversity Privacy
Beyond k-Anonymity. ICDE 2006
15l-Diversity(Contd)
- Entropy l-diversity
- Each equivalence class not only must have enough
different sensitive values, but also the
different sensitive values must be distributed
evenly enough. - Sometimes this maybe too restrictive. When some
values are very common, the entropy of the entire
table may be very low. This leads to the less
conservative notion of l-diversity. - Recursive (c,l)-diversity
- The most frequent value does not appear too
frequently
A. Machanavajjhala et al. l-Diversity Privacy
Beyond k-Anonymity. ICDE 2006
16Limitations of l-Diversity
l-diversity may be difficult and unnecessary to
achieve.
- A single sensitive attribute
- Two values HIV positive (1) and HIV negative
(99) - Very different degrees of sensitivity
- l-diversity is unnecessary to achieve
- 2-diversity is unnecessary for an equivalence
class that contains only negative records - l-diversity is difficult to achieve
- Suppose there are 10000 records in total
- To have distinct 2-diversity, there can be at
most 100001100 equivalence classes
17Limitations of l-Diversity(Contd)
l-diversity is insufficient to prevent attribute
disclosure.
Skewness Attack
- Two sensitive values
- HIV positive (1) and HIV negative (99)
- Serious privacy risk
- Consider an equivalence class that contains an
equal number of positive records and negative
records - l-diversity does not differentiate
- Equivalence class 1 49 positive 1 negative
- Equivalence class 2 1 positive 49 negative
l-diversity does not consider the overall
distribution of sensitive values
18Limitations of l-Diversity(Contd)
l-diversity is insufficient to prevent attribute
disclosure.
A 3-diverse patient table
Similarity Attack
Zipcode Age Salary Disease
476 2 3K Gastric Ulcer
476 2 4K Gastritis
476 2 5K Stomach Cancer
4790 40 6K Gastritis
4790 40 11K Flu
4790 40 8K Bronchitis
476 3 7K Bronchitis
476 3 9K Pneumonia
476 3 10K Stomach Cancer
Bob Bob
Zip Age
47678 27
- Conclusion
- Bobs salary is in 3k,5k, which is relative
low. - Bob has some stomach-related disease.
l-diversity does not consider semantic meanings
of sensitive values
19t-Closeness A New Privacy Measure
A completely generalized table
Age Zipcode Gender Disease
Flu
Heart Disease
Cancer
. . . . . . . . . . . .
Gastritis
Belief Knowledge
B0
ExternalKnowledge
B1
Overall distribution Q of sensitive values
20t-Closeness A New Privacy Measure
A released table
Age Zipcode Gender Disease
2 479 Male Flu
2 479 Male Heart Disease
2 479 Male Cancer
. . . . . . . . . . . .
50 4766 Gastritis
Belief Knowledge
B0
ExternalKnowledge
B1
Overall distribution Q of sensitive values
B2
Distribution Pi of sensitive values in each
equi-class
21t-Closeness A New Privacy Measure
- Observations
- Q should be public
- Knowledge gain in two parts
- Whole population (from B0 to B1)
- Specific individuals (from B1 to B2)
- We bound knowledge gain between B1 and B2 instead
- Principle
- The distance between Q and Pi should be bounded
by a threshold t.
Belief Knowledge
B0
ExternalKnowledge
B1
Overall distribution Q of sensitive values
B2
Distribution Pi of sensitive values in each
equi-class
22How to calculate EMD
- EMD for numerical attributes
- Ordered-distance is a metric
- Non-negative, symmetry, triangle inequality
- Let ripi-qi, then DP,Q is calculated as
23Earth Movers Distance
- Example
- 3k,4k,5k and 3k,4k,5k,6k,7k,8k,9k,10k,11k
- Move 1/9 probability for each of the following
pairs - 3k-gt5k,3k-gt4k cost 1/9(21)/8
- 4k-gt8k,4k-gt7k,4k-gt6k cost 1/9(432)/8
- 5k-gt11k,5k-gt10k,5k-gt9k cost 1/9(564)/8
- Total cost 1/927/80.375
- With P26k,8k,11k , we can get the total cost
is 0.167 lt 0.375. This make more sense than the
other two distance calculation method.
24Motivating Example
- A hospital keeps track of the medical records
collected in the last three months. - The microdata table T(1), and its generalization
T(1), published in Apr. 2007.
Name Age Zipcode Disease
Bob 21 12000 dyspepsia
Alice 22 14000 bronchitis
Andy 24 18000 flu
David 23 25000 gastritis
Gary 41 20000 flu
Helen 36 27000 gastritis
Jane 37 33000 dyspepsia
Ken 40 35000 flu
Linda 43 26000 gastritis
Paul 52 33000 dyspepsia
Steve 56 34000 gastritis
G. ID Age Zipcode Disease
1 21, 22 12k, 14k dyspepsia
1 21, 22 12k, 14k bronchitis
2 23, 24 18k, 25k flu
2 23, 24 18k, 25k gastritis
3 36, 41 20k, 27k flu
3 36, 41 20k, 27k gastritis
4 37, 43 26k, 35k dyspepsia
4 37, 43 26k, 35k flu
4 37, 43 26k, 35k gastritis
5 52, 56 33k, 34k dyspepsia
5 52, 56 33k, 34k gastritis
2-diverse Generalization T(1)
Microdata T(1)
25Motivating Example
- Bob was hospitalized in Mar. 2007
G. ID Age Zipcode Disease
1 21, 22 12k, 14k dyspepsia
1 21, 22 12k, 14k bronchitis
2 23, 24 18k, 25k flu
2 23, 24 18k, 25k gastritis
3 36, 41 20k, 27k flu
3 36, 41 20k, 27k gastritis
4 37, 43 26k, 35k dyspepsia
4 37, 43 26k, 35k flu
4 37, 43 26k, 35k gastritis
5 52, 56 33k, 34k dyspepsia
5 52, 56 33k, 34k gastritis
Name Age Zipcode
Bob 21 12000
2-diverse Generalization T(1)
26Motivating Example
- One month later, in May 2007
Name Age Zipcode Disease
Bob 21 12000 dyspepsia
Alice 22 14000 bronchitis
Andy 24 18000 flu
David 23 25000 gastritis
Gary 41 20000 flu
Helen 36 27000 gastritis
Jane 37 33000 dyspepsia
Ken 40 35000 flu
Linda 43 26000 gastritis
Paul 52 33000 dyspepsia
Steve 56 34000 gastritis
Microdata T(1)
27Motivating Example
- One month later, in May 2007
- Some obsolete tuples are deleted from the
microdata.
Name Age Zipcode Disease
Bob 21 12000 dyspepsia
Alice 22 14000 bronchitis
Andy 24 18000 flu
David 23 25000 gastritis
Gary 41 20000 flu
Helen 36 27000 gastritis
Jane 37 33000 dyspepsia
Ken 40 35000 flu
Linda 43 26000 gastritis
Paul 52 33000 dyspepsia
Steve 56 34000 gastritis
Microdata T(1)
28Motivating Example
Name Age Zipcode Disease
Bob 21 12000 dyspepsia
David 23 25000 gastritis
Gary 41 20000 flu
Jane 37 33000 dyspepsia
Linda 43 26000 gastritis
Steve 56 34000 gastritis
Microdata T(1)
29Motivating Example
- Some new records are inserted.
Name Age Zipcode Disease
Bob 21 12000 dyspepsia
David 23 25000 gastritis
Emily 25 21000 flu
Jane 37 33000 dyspepsia
Linda 43 26000 gastritis
Gary 41 20000 flu
Mary 46 30000 gastritis
Ray 54 31000 dyspepsia
Steve 56 34000 gastritis
Tom 60 44000 gastritis
Vince 65 36000 flu
Microdata T(2)
30Motivating Example
- The hospital published T(2).
G. ID Age Zipcode Disease
1 21, 23 12k, 25k dyspepsia
1 21, 23 12k, 25k gastritis
2 25, 43 21k, 33k flu
2 25, 43 21k, 33k dyspepsia
3 25, 43 21k, 33k gastritis
3 41, 46 20k, 30k flu
4 41, 46 20k, 30k gastritis
4 54, 56 31k, 34k dyspepsia
4 54, 56 31k, 34k gastritis
5 60, 65 36k, 44k gastritis
5 60, 65 36k, 44k flu
Name Age Zipcode Disease
Bob 21 12000 dyspepsia
David 23 25000 gastritis
Emily 25 21000 flu
Jane 37 33000 dyspepsia
Linda 43 26000 gastritis
Gary 41 20000 flu
Mary 46 30000 gastritis
Ray 54 31000 dyspepsia
Steve 56 34000 gastritis
Tom 60 44000 gastritis
Vince 65 36000 flu
2-diverse Generalization T(2)
Microdata T(2)
31Motivating Example
- Consider the previous adversary.
G. ID Age Zipcode Disease
1 21, 23 12k, 25k dyspepsia
1 21, 23 12k, 25k gastritis
2 25, 43 21k, 33k flu
2 25, 43 21k, 33k dyspepsia
3 25, 43 21k, 33k gastritis
3 41, 46 20k, 30k flu
4 41, 46 20k, 30k gastritis
4 54, 56 31k, 34k dyspepsia
4 54, 56 31k, 34k gastritis
5 60, 65 36k, 44k gastritis
5 60, 65 36k, 44k flu
Name Age Zipcode
Bob 21 12000
2-diverse Generalization T(2)
32Motivating Example
- What the adversary learns from T(1).
- What the adversary learns from T(2).
- So Bob must have contracted dyspepsia!
- A new generalization principle is needed.
G. ID Age Zipcode Disease
1 21, 22 12k, 14k dyspepsia
1 21, 22 12k, 14k bronchitis
Name Age Zipcode
Bob 21 12000
G. ID Age Zipcode Disease
1 21, 23 12k, 25k dyspepsia
1 21, 23 12k, 25k gastritis
Name Age Zipcode
Bob 21 12000
33The critical absence phenomenon
Microdata T(2)
What the adversary learns from T(1)
Name Age Zipcode Disease
Bob 21 12000 dyspepsia
David 23 25000 gastritis
Emily 25 21000 flu
Jane 37 33000 dyspepsia
Linda 43 26000 gastritis
Gary 41 20000 flu
Mary 46 30000 gastritis
Ray 54 31000 dyspepsia
Steve 56 34000 gastritis
Tom 60 44000 gastritis
Vince 65 36000 flu
Name Age Zipcode
Bob 21 12000
G. ID Age Zipcode Disease
1 21, 22 12k, 14k dyspepsia
1 21, 22 12k, 14k bronchitis
- We refer to such phenomenon as the critical
absence phenomenon - A new generalization method is needed.
34Name Group-ID Age Zipcode Disease
Bob 1 21, 22 12k, 14k dyspepsia
c1 1 21, 22 12k, 14k bronchitis
David 2 23, 25 21k, 25k gastritis
Emily 2 23, 25 21k, 25k flu
Jane 3 37, 43 26k, 33k dyspepsia
c2 3 37, 43 26k, 33k flu
Linda 3 37, 43 26k, 33k gastritis
Gary 4 41, 46 20k, 30k flu
Mary 4 41, 46 20k, 30k gastritis
Ray 5 54, 56 31k, 34k dyspepsia
Steve 5 54, 56 31k, 34k gastritis
Tom 6 60, 65 36k, 44k gastritis
Vince 6 60, 65 36k, 44k flu
Name Age Zipcode Disease
Bob 21 12000 dyspepsia
David 23 25000 gastritis
Emily 25 21000 flu
Jane 37 33000 dyspepsia
Linda 43 26000 gastritis
Gary 41 20000 flu
Mary 46 30000 gastritis
Ray 54 31000 dyspepsia
Steve 56 34000 gastritis
Tom 60 44000 gastritis
Vince 65 36000 flu
Microdata T(2)
Counterfeited generalization T(2)
Group-ID Count
1 1
3 1
The auxiliary relation R(2) for T(2)
35Name G.ID Age Zipcode Disease
Bob 1 21, 22 12k, 14k dyspepsia
c1 1 21, 22 12k, 14k bronchitis
David 2 23, 25 21k, 25k gastritis
Emily 2 23, 25 21k, 25k flu
Jane 3 37, 43 26k, 33k dyspepsia
c2 3 37, 43 26k, 33k flu
Linda 3 37, 43 26k, 33k gastritis
Gary 4 41, 46 20k, 30k flu
Mary 4 41, 46 20k, 30k gastritis
Ray 5 54, 56 31k, 34k dyspepsia
Steve 5 54, 56 31k, 34k gastritis
Tom 6 60, 65 36k, 44k gastritis
Vince 6 60, 65 36k, 44k flu
Name G.ID Age Zipcode Disease
Bob 1 21, 22 12k, 14k dyspepsia
Alice 1 21, 22 12k, 14k bronchitis
Andy 2 23, 24 18k, 25k flu
David 2 23, 24 18k, 25k gastritis
Gary 3 36, 41 20k, 27k flu
Helen 3 36, 41 20k, 27k gastritis
Jane 4 37, 43 26k, 35k dyspepsia
Ken 4 37, 43 26k, 35k flu
Linda 4 37, 43 26k, 35k gastritis
Paul 5 52, 56 33k, 34k dyspepsia
Steve 5 52, 56 33k, 34k gastritis
Counterfeited Generalization T(2)
Generalization T(1)
Group-ID Count
1 1
3 1
Name Age Zipcode
Bob 21 12000
The auxiliary relation R(2) for T(2)
36m-uniqueness
- A generalized table T(j) is m-unique, if and
only if - each QI-group in T(j) contains at least m
tuples - all tuples in the same QI-group have different
sensitive values.
G. ID Age Zipcode Disease
1 21, 22 12k, 14k dyspepsia
1 21, 22 12k, 14k bronchitis
2 23, 24 18k, 25k flu
2 23, 24 18k, 25k gastritis
3 36, 41 20k, 27k flu
3 36, 41 20k, 27k gastritis
4 37, 43 26k, 35k dyspepsia
4 37, 43 26k, 35k flu
4 37, 43 26k, 35k gastritis
5 52, 56 33k, 34k dyspepsia
5 52, 56 33k, 34k gastritis
A 2-unique generalized table
37Signature
- The signature of Bob in T(1) is dyspepsia,
bronchitis - The signature of Jane in T(1) is dyspepsia,
flu, gastritis
Name G.ID Age Zipcode Disease
Bob 1 21, 22 12k, 14k dyspepsia
Alice 1 21, 22 12k, 14k bronchitis
Jane 4 37, 43 26k, 35k dyspepsia
Ken 4 37, 43 26k, 35k flu
Linda 4 37, 43 26k, 35k gastritis
T(1)
38The m-invariance principle
- A sequence of generalized tables T(1), , T(n)
is m-invariant, if and only if - T(1), , T(n) are m-unique, and
- each individual has the same signature in every
generalized table s/he is involved.
39- A sequence of generalized tables T(1), , T(n)
is m-invariant, if and only if - T(1), , T(n) are m-unique, and
- each individual has the same signature in every
generalized table s/he is involved.
Name G.ID Age Zipcode Disease
Bob 1 21, 22 12k, 14k dyspepsia
c1 1 21, 22 12k, 14k bronchitis
David 2 23, 25 21k, 25k gastritis
Emily 2 23, 25 21k, 25k flu
Jane 3 37, 43 26k, 33k dyspepsia
c2 3 37, 43 26k, 33k flu
Linda 3 37, 43 26k, 33k gastritis
Gary 4 41, 46 20k, 30k flu
Mary 4 41, 46 20k, 30k gastritis
Ray 5 54, 56 31k, 34k dyspepsia
Steve 5 54, 56 31k, 34k gastritis
Tom 6 60, 65 36k, 44k gastritis
Vince 6 60, 65 36k, 44k flu
Name G.ID Age Zipcode Disease
Bob 1 21, 22 12k, 14k dyspepsia
Alice 1 21, 22 12k, 14k bronchitis
Andy 2 23, 24 18k, 25k flu
David 2 23, 24 18k, 25k gastritis
Gary 3 36, 41 20k, 27k flu
Helen 3 36, 41 20k, 27k gastritis
Jane 4 37, 43 26k, 35k dyspepsia
Ken 4 37, 43 26k, 35k flu
Linda 4 37, 43 26k, 35k gastritis
Paul 5 52, 56 33k, 34k dyspepsia
Steve 5 52, 56 33k, 34k gastritis
Generalization T(2)
Generalization T(1)
40- A sequence of generalized tables T(1), , T(n)
is m-invariant, if and only if - T(1), , T(n) are m-unique, and
- each individual has the same signature in every
generalized table s/he is involved.
Name G.ID Age Zipcode Disease
Bob 1 21, 22 12k, 14k dyspepsia
c1 1 21, 22 12k, 14k bronchitis
David 2 23, 25 21k, 25k gastritis
Emily 2 23, 25 21k, 25k flu
Jane 3 37, 43 26k, 33k dyspepsia
c2 3 37, 43 26k, 33k flu
Linda 3 37, 43 26k, 33k gastritis
Gary 4 41, 46 20k, 30k flu
Mary 4 41, 46 20k, 30k gastritis
Ray 5 54, 56 31k, 34k dyspepsia
Steve 5 54, 56 31k, 34k gastritis
Tom 6 60, 65 36k, 44k gastritis
Vince 6 60, 65 36k, 44k flu
Name G.ID Age Zipcode Disease
Bob 1 21, 22 12k, 14k dyspepsia
Alice 1 21, 22 12k, 14k bronchitis
Andy 2 23, 24 18k, 25k flu
David 2 23, 24 18k, 25k gastritis
Gary 3 36, 41 20k, 27k flu
Helen 3 36, 41 20k, 27k gastritis
Jane 4 37, 43 26k, 35k dyspepsia
Ken 4 37, 43 26k, 35k flu
Linda 4 37, 43 26k, 35k gastritis
Paul 5 52, 56 33k, 34k dyspepsia
Steve 5 52, 56 33k, 34k gastritis
Generalization T(2)
Generalization T(1)
41- A sequence of generalized tables T(1), , T(n)
is m-invariant, if and only if - T(1), , T(n) are m-unique, and
- each individual has the same signature in every
generalized table s/he is involved.
Name G.ID Age Zipcode Disease
Bob 1 21, 22 12k, 14k dyspepsia
c1 1 21, 22 12k, 14k bronchitis
David 2 23, 25 21k, 25k gastritis
Emily 2 23, 25 21k, 25k flu
Jane 3 37, 43 26k, 33k dyspepsia
c2 3 37, 43 26k, 33k flu
Linda 3 37, 43 26k, 33k gastritis
Gary 4 41, 46 20k, 30k flu
Mary 4 41, 46 20k, 30k gastritis
Ray 5 54, 56 31k, 34k dyspepsia
Steve 5 54, 56 31k, 34k gastritis
Tom 6 60, 65 36k, 44k gastritis
Vince 6 60, 65 36k, 44k flu
Name G.ID Age Zipcode Disease
Bob 1 21, 22 12k, 14k dyspepsia
Alice 1 21, 22 12k, 14k bronchitis
Andy 2 23, 24 18k, 25k flu
David 2 23, 24 18k, 25k gastritis
Gary 3 36, 41 20k, 27k flu
Helen 3 36, 41 20k, 27k gastritis
Jane 4 37, 43 26k, 35k dyspepsia
Ken 4 37, 43 26k, 35k flu
Linda 4 37, 43 26k, 35k gastritis
Paul 5 52, 56 33k, 34k dyspepsia
Steve 5 52, 56 33k, 34k gastritis
Generalization T(2)
Generalization T(1)
42Motivation 1 Personalization
- Andy does not want anyone to know that he had a
stomach problem - Sarah does not mind at all if others find out
that she had flu -
A 2-diverse table
An external database
Name Age Sex Zipcode
Andy 4 M 12000
Bill 5 M 14000
Ken 6 M 18000
Nash 9 M 19000
Mike 7 M 17000
Alice 12 F 22000
Betty 19 F 24000
Linda 21 F 33000
Jane 25 F 34000
Sarah 28 F 37000
Mary 56 F 58000
Age Sex Zipcode Disease
1, 5 M 10001, 15000 gastric ulcer
1, 5 M 10001, 15000 dyspepsia
6, 10 M 15001, 20000 pneumonia
6, 10 M 15001, 20000 bronchitis
11, 20 F 20001, 25000 flu
11, 20 F 20001, 25000 pneumonia
21, 60 F 30001, 60000 gastritis
21, 60 F 30001, 60000 gastritis
21, 60 F 30001, 60000 flu
21, 60 F 30001, 60000 flu
43Motivation 2 SA generalization
- How many female patients are there with age above
30? - 4 (60 30 ) / (60 20 ) 3
- Real answer 1
-
An external database
A generalized table
Name Age Sex Zipcode
Andy 4 M 12000
Bill 5 M 14000
Ken 6 M 18000
Nash 9 M 19000
Mike 7 M 17000
Alice 12 F 22000
Betty 19 F 24000
Linda 21 F 33000
Jane 25 F 34000
Sarah 28 F 37000
Mary 56 F 58000
Age Sex Zipcode Disease
1, 5 M 10001, 15000 gastric ulcer
1, 5 M 10001, 15000 dyspepsia
6, 10 M 15001, 20000 pneumonia
6, 10 M 15001, 20000 bronchitis
11, 20 F 20001, 25000 flu
11, 20 F 20001, 25000 pneumonia
21, 60 F 30001, 60000 gastritis
21, 60 F 30001, 60000 gastritis
21, 60 F 30001, 60000 flu
21, 60 F 30001, 60000 flu
44Motivation 2 SA generalization (cont.)
- Generalization of the sensitive attribute is
beneficial in this case -
A better generalized table
An external database
Age Sex Zipcode Disease
1, 5 M 10001, 15000 gastric ulcer
1, 5 M 10001, 15000 dyspepsia
6, 10 M 15001, 20000 pneumonia
6, 10 M 15001, 20000 bronchitis
11, 20 F 20001, 25000 flu
11, 20 F 20001, 25000 pneumonia
21, 30 F 30001, 40000 gastritis
21, 30 F 30001, 40000 gastritis
21, 30 F 30001, 40000 flu
56 F 58000 respiratory infection
Name Age Sex Zipcode
Andy 4 M 12000
Bill 5 M 14000
Ken 6 M 18000
Nash 9 M 19000
Mike 7 M 17000
Alice 12 F 22000
Betty 19 F 24000
Linda 21 F 33000
Jane 25 F 34000
Sarah 28 F 37000
Mary 56 F 58000
45Personalized anonymity
- We propose
- a mechanism to capture personalized privacy
requirements - criteria for measuring the degree of security
provided by a generalized table
46Guarding node
- Andy does not want anyone to know that he had a
stomach problem - He can specify stomach disease as the guarding
node for his tuple - The data publisher should prevent an adversary
from associating Andy with stomach disease
Name Age Sex Zipcode Disease guarding node
Andy 4 M 12000 gastric ulcer stomach disease
47Guarding node
- Sarah is willing to disclose her exact symptom
- She can specify Ø as the guarding node for her
tuple
Name Age Sex Zipcode Disease guarding node
Sarah 28 F 37000 flu Ø
48Guarding node
- Bill does not have any special preference
- He can specify the guarding node for his tuple as
the same with his sensitive value
Name Age Sex Zipcode Disease guarding node
Bill 5 M 14000 dyspepsia dyspepsia
49A personalized approach
Name Age Sex Zipcode Disease guarding node
Andy 4 M 12000 gastric ulcer stomach disease
Bill 5 M 14000 dyspepsia dyspepsia
Ken 6 M 18000 pneumonia respiratory infection
Nash 9 M 19000 bronchitis bronchitis
Alice 12 F 22000 flu flu
Betty 19 F 24000 pneumonia pneumonia
Linda 21 F 33000 gastritis gastritis
Jane 25 F 34000 gastritis Ø
Sarah 28 F 37000 flu Ø
Mary 56 F 58000 flu flu
50Personalized anonymity
Name Age Sex Zipcode Disease guarding node
Andy 4 M 12000 gastric ulcer stomach disease
Bill 5 M 14000 dyspepsia dyspepsia
Ken 6 M 18000 pneumonia respiratory infection
Nash 9 M 19000 bronchitis bronchitis
Alice 12 F 22000 flu flu
Betty 19 F 24000 pneumonia pneumonia
Linda 21 F 33000 gastritis gastritis
Jane 25 F 34000 gastritis Ø
Sarah 28 F 37000 flu Ø
Mary 56 F 58000 flu flu
- A table satisfies personalized anonymity with a
parameter pbreach - Iff no adversary can breach the privacy
requirement of any tuple with a probability above
pbreach - If pbreach 0.3, then any adversary should have
no more than 30 probability to find out that - Andy had a stomach disease
- Bill had dyspepsia
- etc
51Personalized anonymity
- Personalized anonymity with respect to a
predefined parameter pbreach - an adversary can breach the privacy requirement
of any tuple with a probability at most pbreach
- We need a method for calculating the breach
probabilities
Age Sex Zipcode Disease
1, 10 M 10001, 20000 gastric ulcer
1, 10 M 10001, 20000 dyspepsia
1, 10 M 10001, 20000 pneumonia
1, 10 M 10001, 20000 bronchitis
11, 20 F 20001, 25000 flu
11, 20 F 20001, 25000 pneumonia
21 F 33000 stomach disease
25 F 34000 gastritis
28 F 37000 flu
56 F 58000 respiratory infection
What is the probability that Andy had some
stomach problem?
52Combinatorial reconstruction
- Assumptions
- the adversary has no prior knowledge about each
individual - every individual involved in the microdata also
appears in the external database
53Combinatorial reconstruction
- Andy does not want anyone to know that he had
some stomach problem - What is the probability that the adversary can
find out that Andy had a stomach disease?
Name Age Sex Zipcode
Andy 4 M 12000
Bill 5 M 14000
Ken 6 M 18000
Nash 9 M 19000
Mike 7 M 17000
Alice 12 F 22000
Betty 19 F 24000
Linda 21 F 33000
Jane 25 F 34000
Sarah 28 F 37000
Mary 56 F 58000
Age Sex Zipcode Disease
1, 10 M 10001, 20000 gastric ulcer
1, 10 M 10001, 20000 dyspepsia
1, 10 M 10001, 20000 pneumonia
1, 10 M 10001, 20000 bronchitis
11, 20 F 20001, 25000 flu
11, 20 F 20001, 25000 pneumonia
21 F 33000 stomach disease
25 F 34000 gastritis
28 F 37000 flu
56 F 58000 respiratory infection
54Combinatorial reconstruction (cont.)
- Can each individual appear more than once?
- No the primary case
- Yes the non-primary case
- Some possible reconstructions
the primary case
the non-primary case
Andy
Bill
Ken
Nash
Mike
Andy
Bill
Ken
Nash
Mike
gastric ulcer
dyspepsia
pneumonia
bronchitis
gastric ulcer
dyspepsia
pneumonia
bronchitis
55Combinatorial reconstruction (cont.)
- Can each individual appear more than once?
- No the primary case
- Yes the non-primary case
- Some possible reconstructions
the primary case
the non-primary case
Andy
Bill
Ken
Nash
Mike
Andy
Bill
Ken
Nash
Mike
gastric ulcer
dyspepsia
pneumonia
bronchitis
gastric ulcer
dyspepsia
pneumonia
bronchitis
56Breach probability (primary)
Andy
Bill
Ken
Nash
Mike
gastric ulcer
dyspepsia
pneumonia
bronchitis
- Totally 120 possible reconstructions
- If Andy is associated with a stomach disease in
nb reconstructions - The probability that the adversary should
associate Andy with some stomach problem is nb /
120 - Andy is associated with
- gastric ulcer in 24 reconstructions
- dyspepsia in 24 reconstructions
- gastritis in 0 reconstructions
- nb 48
- The breach probability for Andys tuple is 48 /
120 2 / 5
57Breach probability (non-primary)
Andy
Bill
Ken
Nash
Mike
gastric ulcer
dyspepsia
pneumonia
bronchitis
- Totally 625 possible reconstructions
- Andy is associated with gastric ulcer or
dyspepsia or gastritis in 225 reconstructions - nb 225
- The breach probability for Andys tuple is
- 225 / 625 9 / 25
58Defect of generalization
- Query A SELECT COUNT() from Unknown-Microdata
- WHERE Disease pneumonia AND Age in 0, 30
- AND Zipcode in 10001, 20000
-
Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 pneumonia
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 gastritis
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 bronchitis
- Estimated answer 2 p, where p is the
probability that each of the two tuples satisfies
the query conditions
59Defect of generalization (cont.)
- Query A SELECT COUNT() from Unknown-Microdata
- WHERE Disease pneumonia AND Age in 0, 30
- AND Zipcode in 10001, 20000
- p Area( R1 n Q ) / Area( R1 ) 0.05
- Estimated answer for query A 2 p 0.1
Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 pneumonia
60Defect of generalization (cont.)
- Query A SELECT COUNT() from Unknown-Microdata
- WHERE Disease pneumonia AND Age in 0, 30
- AND Zipcode in 10001, 20000
- Estimated answer from the generalized table 0.1
- The exact answer should be 1
Name Age Sex Zipcode Disease
Bob 23 M 11000 pneumonia
Ken 27 M 13000 dyspepsia
Peter 35 M 59000 dyspepsia
Sam 59 M 12000 pneumonia
Jane 61 F 54000 flu
Linda 65 F 25000 gastritis
Alice 65 F 25000 flu
Mandy 70 F 30000 bronchitis
61Basic Idea of Anatomy
- For a given microdata table, Anatomy releases a
quasi-identifier table (QIT) and a sensitive
table (ST)
Age Sex Zipcode Disease
23 M 11000 pneumonia
27 M 13000 dyspepsia
35 M 59000 dyspepsia
59 M 12000 pneumonia
61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
Sensitive Table (ST)
Quasi-identifier Table (QIT)
microdata
62Basic Idea of Anatomy (cont.)
- 1. Select a partition of the tuples
Age Sex Zipcode Disease
23 M 11000 pneumonia
27 M 13000 dyspepsia
35 M 59000 dyspepsia
59 M 12000 pneumonia
61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis
QI group 1
QI group 2
a 2-diverse partition
63Basic Idea of Anatomy (cont.)
- 2. Generate a quasi-idnetifier table (QIT) and a
sensitive table (ST) based on the selected
partition
Disease
pneumonia
dyspepsia
dyspepsia
pneumonia
flu
gastritis
flu
bronchitis
Age Sex Zipcode
23 M 11000
27 M 13000
35 M 59000
59 M 12000
61 F 54000
65 F 25000
65 F 25000
70 F 30000
group 1
group 2
quasi-identifier table (QIT)
sensitive table (ST)
64Basic Idea of Anatomy (cont.)
- 2. Generate a quasi-idnetifier table (QIT) and a
sensitive table (ST) based on the selected
partition
Group-ID Disease
1 pneumonia
1 dyspepsia
1 dyspepsia
1 pneumonia
2 flu
2 gastritis
2 flu
2 bronchitis
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
quasi-identifier table (QIT)
sensitive table (ST)
65Basic Idea of Anatomy (cont.)
- 2. Generate a quasi-idnetifier table (QIT) and a
sensitive table (ST) based on the selected
partition
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
sensitive table (ST)
quasi-identifier table (QIT)
66Privacy Preservation
- From a pair of QIT and ST generated from an
l-diverse partition, the adversary can infer the
sensitive value of each individual with
confidence at most 1/l
Name Age Sex Zipcode
Bob 23 M 11000
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
sensitive table (ST)
quasi-identifier table (QIT)
67Accuracy of Data Analysis
- Query A SELECT COUNT() from Unknown-Microdata
- WHERE Disease pneumonia AND Age in 0, 30
- AND Zipcode in 10001, 20000
-
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
sensitive table (ST)
quasi-identifier table (QIT)
68Accuracy of Data Analysis (cont.)
- Query A SELECT COUNT() from Unknown-Microdata
- WHERE Disease pneumonia AND Age in 0, 30
- AND Zipcode in 10001, 20000
- 2 patients have contracted pneumonia
- 2 out of 4 patients satisfies the query condition
on Age and Zipcode - Estimated answer for query A 2 2 / 4 1,
which is also the actual result from the original
microdata
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
t1t2 t3 t4
69Conclusion
- Limitations of l-diversity
- l-diversity is difficult and unnecessary to
achieve - l-diversity is insufficient in preventing
attribute disclosure - t-Closeness as a new privacy measure
- The overall distribution of sensitive values
should be public information - The separation of the knowledge gain
- EMD to measure distance
- EMD captures semantic distance well
- Simple formulas for three ground distances
70Conclusions
- m-invariant table support republication of
dynamic datasets - Guarding nodes allow individuals to describe
their privacy requirements better - Anatomy outperforms generalization by allowing
much more accurate data analysis on the published
data.
71Questions?