PrivacyMaxEnt: Integrating Background Knowledge in Privacy Quantification

About This Presentation

Title:

PrivacyMaxEnt: Integrating Background Knowledge in Privacy Quantification

Description:

Our Privacy-MaxEnt method can be applied to Generalization and Bucketization. ... Most unbiased solution. Maximum Entropy Principle ' ... – PowerPoint PPT presentation

Number of Views:86

Avg rating:3.0/5.0

Slides: 34

Provided by: ecs4

Learn more at: https://web.ecs.syr.edu

Category:

more less

Transcript and Presenter's Notes

Title: PrivacyMaxEnt: Integrating Background Knowledge in Privacy Quantification

1
Privacy-MaxEnt Integrating Background Knowledge
in Privacy Quantification

Wenliang (Kevin) Du,
Zhouxuan Teng,
and Zutao Zhu.
Department of Electrical Engineering Computer
Science
Syracuse University, Syracuse, New York.

2
Introduction

Privacy-Preserving Data Publishing.
The impact of background knowledge
How does it affect privacy?
How to measure its impact on privacy?
Integrate background knowledge in privacy
quantification.
Privacy-MaxEnt A systematic approach.
Based on well-established theories.
Evaluation.

3
Privacy-Preserving Data Publishing

Data disguise methods
Randomization
Generalization (e.g. Mondrian)
Bucketization (e.g. Anatomy)
Our Privacy-MaxEnt method can be applied to
Generalization and Bucketization.
We pick Bucketization in our presentation.

4
Data Sets
Identifier
Quasi-Identifier (QI)
Sensitive Attribute (SA)
5
Bucketized Data
Quasi-Identifier (QI)
Sensitive Attribute (SA)
P( Breast cancer female, college, bucket1 )
1/4 P( Breast cancer female, junior,
bucket2 ) 1/3
6
Impact of Background Knowledge

Background Knowledge
Its rare for male to have breast cancer.

This analysis is hard for large data sets.

7
Previous Studies

Martin, et al. ICDE07.
First formal study on background knowledge
Chen, LeFevre, Ramakrishnan. VLDB07.
Improves the previous work.
They deal with rule-based knowledge.
Deterministic knowledge.
Background knowledge can be much more
complicated.
Uncertain knowledge

8
Complicated Background Knowledge

Rule-based knowledge
P (s q) 1.
P (s q) 0.
Probability-Based Knowledge
P (s q) 0.2.
P (s Alice) 0.2.
Vague background knowledge
0.3 P (s q) 0.5.
Miscellaneous types
P (s q1) P (s q2) 0.7
One of Alice and Bob has Lung Cancer.

9
Challenges

How to analyze privacy in a systematic way for
large data sets and complicated background
knowledge?

What do we want to compute?
P( S Q ), given the background knowledge and
the published data set.
P(S Q ) is primitive for most privacy metrics.

Directly computing P( S Q ) is hard.

10
Our Approach
Consider P( S Q ) as variable x (a vector).
Background Knowledge
Constraints on x
Solve x
Published Data
Constraints on x
Most unbiased solution
Public Information
11
Maximum Entropy Principle

Information theory provides a constructive
criterion for setting up probability
distributions on the basis of partial knowledge,
and leads to a type of statistical inference
which is called the maximum entropy estimate. It
is least biased estimate possible on the given
information.
by E. T. Jaynes, 1957.

12
The MaxEnt Approach
Background Knowledge
Constraints on P( S Q )
Maximum Entropy Estimate
Estimate P( S Q )
Published Data
Constraints on P( S Q )
Public Information
13
Entropy
Because H(S Q, B) H(Q, S, B) H(Q, B)
Constraint should use P(Q, S, B) as variables
14
Maximum Entropy Estimate

Let vector x P(Q, S, B).
Find the value for x that maximizes its entropy
H(Q, S, B), while satisfying
h1(x) c1, , hu(x) cu equality constraints
g1(x) d1, , gv(x) dv inequality
constraints
A special case of Non-Linear Programming.

15
Constraints from Knowledge
Background Knowledge
Constraints on P(Q, S, B)

Linear model quite generic.
Conditional probability
P (S Q) P(Q, S) / P(Q).
Background knowledge has nothing to do with B
P(Q, S) P(Q, S, B1) P(Q, S, Bm).

16
Constraints from Published Data
Published Data Set D
Constraints on P(Q, S, B)

Constraints
Truth and only the truth.
Absolutely correct for the original data set.
No inference.

17
Assignment and Constraints
Observation the original data is one of the
assignments Constraint true for all possible
assignments
18
QI Constraint
Constraint
Example
19
SA Constraint
Constraint
Example
20
Zero Constraint

P(q, s, b) 0, if q or s does not appear in
Bucket b.
We can reduce the number of variables.

21
Theoretic Properties

Soundness Are they correct?
Easy to prove.
Completeness Have we missed any constraint?
See our theorems and proofs.
Conciseness Are there redundant constraints?
Only one redundant constraint in each bucket.
Consistency Is our approach consistent with the
existing methods (i.e., when background knowledge
is Ø).

22
Completeness w.r.t Equations

Have we missed any equality constraint?
Yes!
If F1 C1 and F2 C2 are constraints, F1 F2
C1 C2 is too. However, it is redundant.

Completeness Theorem
U our constraint set.
All linear constraints can be written as the
linear combinations of the constraints in U.

23
Completeness w.r.t Inequalities

Have we missed any inequalities constraint?
Yes!
If F C, then F C0.2 is also valid
(redundant).

Completeness Theorem
Our constraint set is also complete in the
inequality sense.

24
Putting Them Together
Tools LBFGS, TOMLAB, KNITRO, etc.
Background Knowledge
Constraints on P( S Q )
Maximum Entropy Estimate
Estimate P( S Q )
Published Data
Constraints on P( S Q )
Public Information
25
Inevitable Questions

Where do we get background knowledge?
Do we have to be very very knowledgeable?

For P (s q) type of knowledge
All useful knowledge is in the original data set.
Association rules
Positive Q ? S
Negative Q ? S, Q ? S, Q ? S
Bound the knowledge in our study.
Top-K strongest association rules.

26
Knowledge about Individuals
Alice (i1, q1) Bob (i4, q2) Charlie
(i9, q5)
Knowledge 1 Alice has either s1 or s4.
Constraint
Knowledge 1 Two people among Alice, Bob, and
Charlie have s4.
Constraint
27
Evaluation

Implementation
Lagrange multipliers
Constrained Optimization ?Unconstrained
Optimization
LBFGS solving the unconstrained optimization
problem.
Pentium 3Ghz CPU with 4GB memory.

28
Privacy versus Knowledge
Estimation Accuracy KL Distance between
P(MaxEnt) (S Q) and P(Original) (S Q).
29
Privacy versus of QI attributes
30
Performance vs. Knowledge
31
Running Time vs. Data Size
32
Iteration vs. Data size
33
Conclusion