Random Data Perturbation Techniques and Privacy Preserving Data Mining presentation

About This Presentation

Transcript and Presenter's Notes

Title: Random Data Perturbation Techniques and Privacy Preserving Data Mining

1
Random Data Perturbation Techniques and Privacy
Preserving Data Mining
(Authors H. Kargupta, S. Datta, Q. Wang K.
Sivakumar)
April 26, 2005

Gunjan Gupta

2
Privacy Good Service Often Conflicting Goals

Privacy
Customer I dont want you to share my personal
information with anyone.
Business I dont want to share my data with a
competitor.
Quantity, Cost Quality of Service
Customer I want you to provide me lower cost of
service
and good quality.
and at lower cost.
Paradox lower cost often comes from being able
to use/share sensitive data that can be used or
misused
Provide better service by predicting consumer
needs better, or sell information to marketers.
Optimize load sharing between competing utilities
or preempting competition.
Doctor saving patient by knowing patient history
or insurance companies declining coverage to
individuals with preexisting conditions.

3
Can we use privacy sensitive data to optimize
cost and quality of a service without
compromising any privacy?
Central Question
4
Short AnswerNo!
5
Long AnswerMaybe compromise a small amount of
privacy (low cost increase) to improve quality
and cost of service (high cost savings)
substantially.
6
Why anonymous exact records not so secure?

Example medical insurance premium estimation
based on patient history
Predictive fields often generic age, sex,
disease history, first two digits of zip code
(not allowed in Germany). no. of kids etc.
Specifics such as record id (key), name, address
omitted.
This could be easily broken by matching
non-secure records with secure anonymous records

Yellowpages
Anonymous privacy preserving records
Susan Calvin, 121 Norwood Cr. Austin, TX-78753
Female, 43, 3 kids, 78---,married, anonymous
medical record 1
Personal website
Female, 43, 2 kids, 78---, single anonymous
medical record 2
Hi, I am Susan, and here are pictures of me, my
husband, and my 3 wonderful kids from my 43rd
birthday party!
Internal Human Automated hacker
Susan Calvin, 43, 3 kids, Address, 78733, now
labeled med. Records!
Broken Exact record
7
Two approaches to Privacy Preserving

Distributed
Suitable for multi-party platforms. Share
sub-models.
Unsupervised Ensemble Clustering, Privacy
Preserving Clustering etc.
Supervised Meta-learners, Fourier Spectrum
Decision Trees, Collective Hierarchical
Clustering and so on..
Secure communication based Secure sum, secure
scalar product
Random Data Perturbation Our focus
Perturb data by small amounts to protect privacy
of individual records.
Preserve intrinsic distributions necessary for
modeling.

8
Recovering approximately correct anonymous
features also breaks privacy

Somewhat inexactly recovered anonymous record
values might also be sufficient

yellowpages
Denoised privacy preserving records
Susan Calvin, 121 Norwood Cr. Austin, TX-78753
Female, 44.5, 3.2 kids, 78---,married, anonymous
medical record 1
Personal website
Female, 42.2, 2.1 kids, 78---, single, anonymous
medical record 2
Hi, I am Susan, and here are pictures of me, my
husband, and my 3 wonderful kids from my 43rd
birthday party!
Internal Human Automated hacker
Susan Calvin, 43, 3 kids, Address, 78733, now
labeled med. Records!
Broken Exact record
9
Anonymous records (with or without) small
perturbations not secure not a recently noticed
phenomena

1979, Denning Denning The Tracker A Threat to
Statistical Database Security
Show why anonymous records are not secure.
Show example of recovering exact salary of a
professor from anonymous records.
Present a general algorithm for an Individual
Tracker.
A formal probabilistic model and set of
conditions that make a dataset support such a
tracker.
1984, Traub Yemin The Statistical Security of
a Statistical Database
No free lunch perturbations cause irrecoverable
loss in model accuracy.
However, the holy grail of random perturbation

We can try to find a perturbation algorithm that
best trades off between loss of privacy vs.
model accuracy.
10
Recovering perturbed distributions Earlier work

Reconstructing Original Distribution from
Perturbed Ones. Setup
N samples U1, U2, U3.. Xn
N noise values V1, V2, V3.. Vn all taken from a
public(known) distribution V.
Visible noisy data W1U1V1, W2U2V2 . .
Assumption Such noise can allow you to recover
the distribution of X1,X2,X3 ..Xn, but not the
individual records.
Two well known methods and definitions
Agrawal Srikant
Interval based Privacy(X) at Confidence 0.95
X2-X1
Agrawal Aggarwal
Distributional Privacy(X)2h(x)

X1
X2
f(x)
f(x)
11
Interval Based Method Agrawal Srikant in more
detail

N samples U1, U2, U3.. Xn
N noise values V1, V2, V3.. Vn all taken from a
public(known) distribution V.
W1U1V1, W2U2V2 . .
Visible noisy data W1, W2, W3 ..
Given noise function fV , using Bayes Rule, we
can show that the cumulative posterior
distribution function of u in terms of w
(visible) and fV , and unknown desired function
fu ,

Differentiating w.r.t. u we get an important
recursive definition
Notation issue (in paper) f simply means
approximation of true f, not derivative of f !
12
Interval Based Method Agrawal Srikant in more
detail
Algorithm in practice
Seed with a uniform distribution for J0
STEP J
STEP J1
replaced integration with summation over i.i.d
samples
sum over discrete z intervals instead of integral
for speed

Converges to a local minima? Different than
uniform initialization might give a different
result. Not explored by authors.
For large enough samples, hope to get close to
true distribution.
Stop when fU(J1) fU(J) becomes small.

13
Interval Based Method Good Results for a variety
of noises
14
Revisiting an Essential Assumption in the Random
Perturbation
Assumption Such noise can allow you to recover
the distribution of X1,X2,X3 ..Xn, but not the
individual records.

The Authors in this paper challenge this
assumption.
Claim randomness addition can be mostly visual
and not real
Many simple forms of random perturbations are
breakable.

15
Exploit predictable properties of Random data to
design a filter to break the perturbation
encryption?
All eigen-values close to 1!
Spiral data
Random data
16
Spectral Filtering Main Idea Use eigen-values
properties of noise to filter

UV data
Decomposition of eeigen-values of noise and
original data
Recovered data

17
Decomposing eigen-values separating data from
noise
Let U and V be the m x n data and noise
matrices P the perturbed matrix UP UV
Covariance matrix of UP UP T UP (UV) T (UV)
UTU VTU UTV UTU
Since signal and noise are uncorrelated in random
perturbation, for large no. of observations VTU
0 and UTV 0, therefore
UP T UP UTU VTV
Since the above 3 matrices are correlation
matrices, they are symmetric and positive
semi-definite, therefore, we can perform eigen
decomposition
18
With bunch of algebra and theorems from Matrix
Perturbation theory, authors show that in the
limit (lots of data)..

Wigners law Describes distribution of eigen
values for normal random matrices
eigen values for noise component V stick in a
thin range given by ?min and ?max (show example
next page) with high probability.
Allows us to compute ?min and ?max.

Solution!

Giving us the following algorithm
Find a large no. of eigen values of the perturbed
data P.
Separate all eigen values inside ?min and ?max
and save row indices IV
Take the remaining eigen indices to get the
peturbed but not noise eigens coming from true
data U save their row indices IU
Break perturbed eigenvector matrix QP into AU
QP (IU), AV QP (IV).
Estimate true data as projection

19
Exploit predictable properties of Random data to
design a filter to break the perturbation
encryption?
All eigen-values close to 1!
Spiral data
Random data
20
Results Quality of Eeigen values recovery
Only the real eigens got captured, because of
the nice automatic thresholding !
21
Results Comparison with Aggarwals reproduction
Agrawal Srikant (no breaking of encryption)
Agrawal Srikant (estimated from broken
encryption)
22
Discussion

Amazing amount of experimental results and
comparisons presented by authors in the Journal
version.
Extension to a situation where perturbing
distribution form is known but exact first ,
second or higher order statistics not known
discussed but not presented.
Comparison of performance with other obvious
techniques for noise reduction in signal
processing community
Moving Averages and Weiner Filtering.
PCA Based filtering.
Pros and Cons of the perturbation analysis by
authors (and in general)
Effect of more and more noise rapid degradation
of results.
Problem in dealing with inherent noise in
original data.
Technique fails when features independent of each
other because of Covariance matrix exploitation
Points to a major improvement possibility in
encryption perform ICA/PCA and then randomize?
Results suggest that more complex noise models
might be harder to break. Not clear if this
improves privacy-model quality tradeoff?
eigen decomposition has an inherent metric
assumption?

23
A not-so-ominous application of noise filtering
Nulling Interferometer on Terrestrial Planet
Finder-I

Write a Comment

User Comments (0)

About PowerShow.com

Random Data Perturbation Techniques and Privacy Preserving Data Mining PowerPoint PPT Presentation