Title: PrivacyMaxEnt: Integrating Background Knowledge in Privacy Quantification
1Privacy-MaxEnt Integrating Background Knowledge
in Privacy Quantification
- Wenliang (Kevin) Du,
- Zhouxuan Teng,
- and Zutao Zhu.
- Department of Electrical Engineering Computer
Science - Syracuse University, Syracuse, New York.
2Introduction
- Privacy-Preserving Data Publishing.
- The impact of background knowledge
- How does it affect privacy?
- How to measure its impact on privacy?
- Integrate background knowledge in privacy
quantification. - Privacy-MaxEnt A systematic approach.
- Based on well-established theories.
- Evaluation.
3Privacy-Preserving Data Publishing
- Data disguise methods
- Randomization
- Generalization (e.g. Mondrian)
- Bucketization (e.g. Anatomy)
- Our Privacy-MaxEnt method can be applied to
Generalization and Bucketization. - We pick Bucketization in our presentation.
4Data Sets
Identifier
Quasi-Identifier (QI)
Sensitive Attribute (SA)
5Bucketized Data
Quasi-Identifier (QI)
Sensitive Attribute (SA)
P( Breast cancer female, college, bucket1 )
1/4 P( Breast cancer female, junior,
bucket2 ) 1/3
6Impact of Background Knowledge
- Background Knowledge
- Its rare for male to have breast cancer.
- This analysis is hard for large data sets.
7Previous Studies
- Martin, et al. ICDE07.
- First formal study on background knowledge
- Chen, LeFevre, Ramakrishnan. VLDB07.
- Improves the previous work.
- They deal with rule-based knowledge.
- Deterministic knowledge.
- Background knowledge can be much more
complicated. - Uncertain knowledge
8Complicated Background Knowledge
- Rule-based knowledge
- P (s q) 1.
- P (s q) 0.
- Probability-Based Knowledge
- P (s q) 0.2.
- P (s Alice) 0.2.
- Vague background knowledge
- 0.3 P (s q) 0.5.
- Miscellaneous types
- P (s q1) P (s q2) 0.7
- One of Alice and Bob has Lung Cancer.
9Challenges
- How to analyze privacy in a systematic way for
large data sets and complicated background
knowledge?
- What do we want to compute?
- P( S Q ), given the background knowledge and
the published data set. - P(S Q ) is primitive for most privacy metrics.
- Directly computing P( S Q ) is hard.
10Our Approach
Consider P( S Q ) as variable x (a vector).
Background Knowledge
Constraints on x
Solve x
Published Data
Constraints on x
Most unbiased solution
Public Information
11Maximum Entropy Principle
- Information theory provides a constructive
criterion for setting up probability
distributions on the basis of partial knowledge,
and leads to a type of statistical inference
which is called the maximum entropy estimate. It
is least biased estimate possible on the given
information. - by E. T. Jaynes, 1957.
12The MaxEnt Approach
Background Knowledge
Constraints on P( S Q )
Maximum Entropy Estimate
Estimate P( S Q )
Published Data
Constraints on P( S Q )
Public Information
13Entropy
Because H(S Q, B) H(Q, S, B) H(Q, B)
Constraint should use P(Q, S, B) as variables
14Maximum Entropy Estimate
- Let vector x P(Q, S, B).
- Find the value for x that maximizes its entropy
H(Q, S, B), while satisfying - h1(x) c1, , hu(x) cu equality constraints
- g1(x) d1, , gv(x) dv inequality
constraints - A special case of Non-Linear Programming.
15Constraints from Knowledge
Background Knowledge
Constraints on P(Q, S, B)
- Linear model quite generic.
- Conditional probability
- P (S Q) P(Q, S) / P(Q).
- Background knowledge has nothing to do with B
- P(Q, S) P(Q, S, B1) P(Q, S, Bm).
16Constraints from Published Data
Published Data Set D
Constraints on P(Q, S, B)
- Constraints
- Truth and only the truth.
- Absolutely correct for the original data set.
- No inference.
17Assignment and Constraints
Observation the original data is one of the
assignments Constraint true for all possible
assignments
18QI Constraint
Constraint
Example
19SA Constraint
Constraint
Example
20Zero Constraint
- P(q, s, b) 0, if q or s does not appear in
Bucket b. - We can reduce the number of variables.
21Theoretic Properties
- Soundness Are they correct?
- Easy to prove.
- Completeness Have we missed any constraint?
- See our theorems and proofs.
- Conciseness Are there redundant constraints?
- Only one redundant constraint in each bucket.
- Consistency Is our approach consistent with the
existing methods (i.e., when background knowledge
is Ø).
22Completeness w.r.t Equations
- Have we missed any equality constraint?
- Yes!
- If F1 C1 and F2 C2 are constraints, F1 F2
C1 C2 is too. However, it is redundant.
- Completeness Theorem
- U our constraint set.
- All linear constraints can be written as the
linear combinations of the constraints in U.
23Completeness w.r.t Inequalities
- Have we missed any inequalities constraint?
- Yes!
- If F C, then F C0.2 is also valid
(redundant).
- Completeness Theorem
- Our constraint set is also complete in the
inequality sense.
24Putting Them Together
Tools LBFGS, TOMLAB, KNITRO, etc.
Background Knowledge
Constraints on P( S Q )
Maximum Entropy Estimate
Estimate P( S Q )
Published Data
Constraints on P( S Q )
Public Information
25Inevitable Questions
- Where do we get background knowledge?
- Do we have to be very very knowledgeable?
- For P (s q) type of knowledge
- All useful knowledge is in the original data set.
- Association rules
- Positive Q ? S
- Negative Q ? S, Q ? S, Q ? S
- Bound the knowledge in our study.
- Top-K strongest association rules.
26Knowledge about Individuals
Alice (i1, q1) Bob (i4, q2) Charlie
(i9, q5)
Knowledge 1 Alice has either s1 or s4.
Constraint
Knowledge 1 Two people among Alice, Bob, and
Charlie have s4.
Constraint
27Evaluation
- Implementation
- Lagrange multipliers
- Constrained Optimization ?Unconstrained
Optimization - LBFGS solving the unconstrained optimization
problem. - Pentium 3Ghz CPU with 4GB memory.
28Privacy versus Knowledge
Estimation Accuracy KL Distance between
P(MaxEnt) (S Q) and P(Original) (S Q).
29Privacy versus of QI attributes
30Performance vs. Knowledge
31Running Time vs. Data Size
32Iteration vs. Data size
33Conclusion
- Privacy-MaxEnt is a systematic method
- Model various types of knowledge
- Model the information from the published data
- Based on well-established theory.
- Future work
- Reducing the of constraints
- Vague background knowledge
- Background knowledge about individuals