Title: Random Data Perturbation Techniques and Privacy Preserving Data Mining
1Random Data Perturbation Techniques and Privacy
Preserving Data Mining
(Authors H. Kargupta, S. Datta, Q. Wang K.
Sivakumar)
April 26, 2005
2Privacy Good Service Often Conflicting Goals
- Privacy
- Customer I dont want you to share my personal
information with anyone. - Business I dont want to share my data with a
competitor. - Quantity, Cost Quality of Service
- Customer I want you to provide me lower cost of
service - and good quality.
- and at lower cost.
- Paradox lower cost often comes from being able
to use/share sensitive data that can be used or
misused - Provide better service by predicting consumer
needs better, or sell information to marketers. - Optimize load sharing between competing utilities
or preempting competition. - Doctor saving patient by knowing patient history
or insurance companies declining coverage to
individuals with preexisting conditions.
3Can we use privacy sensitive data to optimize
cost and quality of a service without
compromising any privacy?
Central Question
4Short AnswerNo!
5Long AnswerMaybe compromise a small amount of
privacy (low cost increase) to improve quality
and cost of service (high cost savings)
substantially.
6Why anonymous exact records not so secure?
- Example medical insurance premium estimation
based on patient history - Predictive fields often generic age, sex,
disease history, first two digits of zip code
(not allowed in Germany). no. of kids etc. - Specifics such as record id (key), name, address
omitted. - This could be easily broken by matching
non-secure records with secure anonymous records
Yellowpages
Anonymous privacy preserving records
Susan Calvin, 121 Norwood Cr. Austin, TX-78753
Female, 43, 3 kids, 78---,married, anonymous
medical record 1
Personal website
Female, 43, 2 kids, 78---, single anonymous
medical record 2
Hi, I am Susan, and here are pictures of me, my
husband, and my 3 wonderful kids from my 43rd
birthday party!
Internal Human Automated hacker
Susan Calvin, 43, 3 kids, Address, 78733, now
labeled med. Records!
Broken Exact record
7Two approaches to Privacy Preserving
- Distributed
- Suitable for multi-party platforms. Share
sub-models. - Unsupervised Ensemble Clustering, Privacy
Preserving Clustering etc. - Supervised Meta-learners, Fourier Spectrum
Decision Trees, Collective Hierarchical
Clustering and so on.. - Secure communication based Secure sum, secure
scalar product - Random Data Perturbation Our focus
- Perturb data by small amounts to protect privacy
of individual records. - Preserve intrinsic distributions necessary for
modeling.
8Recovering approximately correct anonymous
features also breaks privacy
- Somewhat inexactly recovered anonymous record
values might also be sufficient
yellowpages
Denoised privacy preserving records
Susan Calvin, 121 Norwood Cr. Austin, TX-78753
Female, 44.5, 3.2 kids, 78---,married, anonymous
medical record 1
Personal website
Female, 42.2, 2.1 kids, 78---, single, anonymous
medical record 2
Hi, I am Susan, and here are pictures of me, my
husband, and my 3 wonderful kids from my 43rd
birthday party!
Internal Human Automated hacker
Susan Calvin, 43, 3 kids, Address, 78733, now
labeled med. Records!
Broken Exact record
9Anonymous records (with or without) small
perturbations not secure not a recently noticed
phenomena
- 1979, Denning Denning The Tracker A Threat to
Statistical Database Security - Show why anonymous records are not secure.
- Show example of recovering exact salary of a
professor from anonymous records. - Present a general algorithm for an Individual
Tracker. - A formal probabilistic model and set of
conditions that make a dataset support such a
tracker. - 1984, Traub Yemin The Statistical Security of
a Statistical Database - No free lunch perturbations cause irrecoverable
loss in model accuracy. - However, the holy grail of random perturbation
We can try to find a perturbation algorithm that
best trades off between loss of privacy vs.
model accuracy.
10Recovering perturbed distributions Earlier work
- Reconstructing Original Distribution from
Perturbed Ones. Setup - N samples U1, U2, U3.. Xn
- N noise values V1, V2, V3.. Vn all taken from a
public(known) distribution V. - Visible noisy data W1U1V1, W2U2V2 . .
- Assumption Such noise can allow you to recover
the distribution of X1,X2,X3 ..Xn, but not the
individual records. - Two well known methods and definitions
- Agrawal Srikant
- Interval based Privacy(X) at Confidence 0.95
X2-X1 - Agrawal Aggarwal
- Distributional Privacy(X)2h(x)
X1
X2
f(x)
f(x)
11Interval Based Method Agrawal Srikant in more
detail
- N samples U1, U2, U3.. Xn
- N noise values V1, V2, V3.. Vn all taken from a
public(known) distribution V. - W1U1V1, W2U2V2 . .
- Visible noisy data W1, W2, W3 ..
- Given noise function fV , using Bayes Rule, we
can show that the cumulative posterior
distribution function of u in terms of w
(visible) and fV , and unknown desired function
fu ,
Differentiating w.r.t. u we get an important
recursive definition
Notation issue (in paper) f simply means
approximation of true f, not derivative of f !
12Interval Based Method Agrawal Srikant in more
detail
Algorithm in practice
Seed with a uniform distribution for J0
STEP J
STEP J1
replaced integration with summation over i.i.d
samples
sum over discrete z intervals instead of integral
for speed
- Converges to a local minima? Different than
uniform initialization might give a different
result. Not explored by authors. - For large enough samples, hope to get close to
true distribution. - Stop when fU(J1) fU(J) becomes small.
13Interval Based Method Good Results for a variety
of noises
14Revisiting an Essential Assumption in the Random
Perturbation
Assumption Such noise can allow you to recover
the distribution of X1,X2,X3 ..Xn, but not the
individual records.
- The Authors in this paper challenge this
assumption. - Claim randomness addition can be mostly visual
and not real - Many simple forms of random perturbations are
breakable.
15Exploit predictable properties of Random data to
design a filter to break the perturbation
encryption?
All eigen-values close to 1!
Spiral data
Random data
16Spectral Filtering Main Idea Use eigen-values
properties of noise to filter
- UV data
- Decomposition of eeigen-values of noise and
original data - Recovered data
17Decomposing eigen-values separating data from
noise
Let U and V be the m x n data and noise
matrices P the perturbed matrix UP UV
Covariance matrix of UP UP T UP (UV) T (UV)
UTU VTU UTV UTU
Since signal and noise are uncorrelated in random
perturbation, for large no. of observations VTU
0 and UTV 0, therefore
UP T UP UTU VTV
Since the above 3 matrices are correlation
matrices, they are symmetric and positive
semi-definite, therefore, we can perform eigen
decomposition
18With bunch of algebra and theorems from Matrix
Perturbation theory, authors show that in the
limit (lots of data)..
- Wigners law Describes distribution of eigen
values for normal random matrices - eigen values for noise component V stick in a
thin range given by ?min and ?max (show example
next page) with high probability. - Allows us to compute ?min and ?max.
Solution!
- Giving us the following algorithm
- Find a large no. of eigen values of the perturbed
data P. - Separate all eigen values inside ?min and ?max
and save row indices IV - Take the remaining eigen indices to get the
peturbed but not noise eigens coming from true
data U save their row indices IU - Break perturbed eigenvector matrix QP into AU
QP (IU), AV QP (IV). - Estimate true data as projection
19Exploit predictable properties of Random data to
design a filter to break the perturbation
encryption?
All eigen-values close to 1!
Spiral data
Random data
20Results Quality of Eeigen values recovery
Only the real eigens got captured, because of
the nice automatic thresholding !
21Results Comparison with Aggarwals reproduction
Agrawal Srikant (no breaking of encryption)
Agrawal Srikant (estimated from broken
encryption)
22Discussion
- Amazing amount of experimental results and
comparisons presented by authors in the Journal
version. - Extension to a situation where perturbing
distribution form is known but exact first ,
second or higher order statistics not known
discussed but not presented. - Comparison of performance with other obvious
techniques for noise reduction in signal
processing community - Moving Averages and Weiner Filtering.
- PCA Based filtering.
- Pros and Cons of the perturbation analysis by
authors (and in general) - Effect of more and more noise rapid degradation
of results. - Problem in dealing with inherent noise in
original data. - Technique fails when features independent of each
other because of Covariance matrix exploitation
Points to a major improvement possibility in
encryption perform ICA/PCA and then randomize? - Results suggest that more complex noise models
might be harder to break. Not clear if this
improves privacy-model quality tradeoff? - eigen decomposition has an inherent metric
assumption?
23A not-so-ominous application of noise filtering
Nulling Interferometer on Terrestrial Planet
Finder-I