Random Data Perturbation Techniques and Privacy Preserving Data Mining - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Random Data Perturbation Techniques and Privacy Preserving Data Mining

Description:

Random Data Perturbation Techniques and Privacy Preserving Data Mining (Authors: H. Kargupta, S. Datta, Q. Wang & K. Sivakumar) April 26, 2005 Gunjan Gupta – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 24
Provided by: idealEceU
Category:

less

Transcript and Presenter's Notes

Title: Random Data Perturbation Techniques and Privacy Preserving Data Mining


1
Random Data Perturbation Techniques and Privacy
Preserving Data Mining
(Authors H. Kargupta, S. Datta, Q. Wang K.
Sivakumar)
April 26, 2005
  • Gunjan Gupta

2
Privacy Good Service Often Conflicting Goals
  • Privacy
  • Customer I dont want you to share my personal
    information with anyone.
  • Business I dont want to share my data with a
    competitor.
  • Quantity, Cost Quality of Service
  • Customer I want you to provide me lower cost of
    service
  • and good quality.
  • and at lower cost.
  • Paradox lower cost often comes from being able
    to use/share sensitive data that can be used or
    misused
  • Provide better service by predicting consumer
    needs better, or sell information to marketers.
  • Optimize load sharing between competing utilities
    or preempting competition.
  • Doctor saving patient by knowing patient history
    or insurance companies declining coverage to
    individuals with preexisting conditions.

3
Can we use privacy sensitive data to optimize
cost and quality of a service without
compromising any privacy?
Central Question
4
Short AnswerNo!
5
Long AnswerMaybe compromise a small amount of
privacy (low cost increase) to improve quality
and cost of service (high cost savings)
substantially.
6
Why anonymous exact records not so secure?
  • Example medical insurance premium estimation
    based on patient history
  • Predictive fields often generic age, sex,
    disease history, first two digits of zip code
    (not allowed in Germany). no. of kids etc.
  • Specifics such as record id (key), name, address
    omitted.
  • This could be easily broken by matching
    non-secure records with secure anonymous records

Yellowpages
Anonymous privacy preserving records
Susan Calvin, 121 Norwood Cr. Austin, TX-78753
Female, 43, 3 kids, 78---,married, anonymous
medical record 1
Personal website
Female, 43, 2 kids, 78---, single anonymous
medical record 2
Hi, I am Susan, and here are pictures of me, my
husband, and my 3 wonderful kids from my 43rd
birthday party!
Internal Human Automated hacker
Susan Calvin, 43, 3 kids, Address, 78733, now
labeled med. Records!
Broken Exact record
7
Two approaches to Privacy Preserving
  • Distributed
  • Suitable for multi-party platforms. Share
    sub-models.
  • Unsupervised Ensemble Clustering, Privacy
    Preserving Clustering etc.
  • Supervised Meta-learners, Fourier Spectrum
    Decision Trees, Collective Hierarchical
    Clustering and so on..
  • Secure communication based Secure sum, secure
    scalar product
  • Random Data Perturbation Our focus
  • Perturb data by small amounts to protect privacy
    of individual records.
  • Preserve intrinsic distributions necessary for
    modeling.

8
Recovering approximately correct anonymous
features also breaks privacy
  • Somewhat inexactly recovered anonymous record
    values might also be sufficient

yellowpages
Denoised privacy preserving records
Susan Calvin, 121 Norwood Cr. Austin, TX-78753
Female, 44.5, 3.2 kids, 78---,married, anonymous
medical record 1
Personal website
Female, 42.2, 2.1 kids, 78---, single, anonymous
medical record 2
Hi, I am Susan, and here are pictures of me, my
husband, and my 3 wonderful kids from my 43rd
birthday party!
Internal Human Automated hacker
Susan Calvin, 43, 3 kids, Address, 78733, now
labeled med. Records!
Broken Exact record
9
Anonymous records (with or without) small
perturbations not secure not a recently noticed
phenomena
  • 1979, Denning Denning The Tracker A Threat to
    Statistical Database Security
  • Show why anonymous records are not secure.
  • Show example of recovering exact salary of a
    professor from anonymous records.
  • Present a general algorithm for an Individual
    Tracker.
  • A formal probabilistic model and set of
    conditions that make a dataset support such a
    tracker.
  • 1984, Traub Yemin The Statistical Security of
    a Statistical Database
  • No free lunch perturbations cause irrecoverable
    loss in model accuracy.
  • However, the holy grail of random perturbation

We can try to find a perturbation algorithm that
best trades off between loss of privacy vs.
model accuracy.
10
Recovering perturbed distributions Earlier work
  • Reconstructing Original Distribution from
    Perturbed Ones. Setup
  • N samples U1, U2, U3.. Xn
  • N noise values V1, V2, V3.. Vn all taken from a
    public(known) distribution V.
  • Visible noisy data W1U1V1, W2U2V2 . .
  • Assumption Such noise can allow you to recover
    the distribution of X1,X2,X3 ..Xn, but not the
    individual records.
  • Two well known methods and definitions
  • Agrawal Srikant
  • Interval based Privacy(X) at Confidence 0.95
    X2-X1
  • Agrawal Aggarwal
  • Distributional Privacy(X)2h(x)

X1
X2
f(x)
f(x)
11
Interval Based Method Agrawal Srikant in more
detail
  • N samples U1, U2, U3.. Xn
  • N noise values V1, V2, V3.. Vn all taken from a
    public(known) distribution V.
  • W1U1V1, W2U2V2 . .
  • Visible noisy data W1, W2, W3 ..
  • Given noise function fV , using Bayes Rule, we
    can show that the cumulative posterior
    distribution function of u in terms of w
    (visible) and fV , and unknown desired function
    fu ,

Differentiating w.r.t. u we get an important
recursive definition
Notation issue (in paper) f simply means
approximation of true f, not derivative of f !
12
Interval Based Method Agrawal Srikant in more
detail
Algorithm in practice
Seed with a uniform distribution for J0
STEP J
STEP J1
replaced integration with summation over i.i.d
samples
sum over discrete z intervals instead of integral
for speed
  • Converges to a local minima? Different than
    uniform initialization might give a different
    result. Not explored by authors.
  • For large enough samples, hope to get close to
    true distribution.
  • Stop when fU(J1) fU(J) becomes small.

13
Interval Based Method Good Results for a variety
of noises
14
Revisiting an Essential Assumption in the Random
Perturbation
Assumption Such noise can allow you to recover
the distribution of X1,X2,X3 ..Xn, but not the
individual records.
  • The Authors in this paper challenge this
    assumption.
  • Claim randomness addition can be mostly visual
    and not real
  • Many simple forms of random perturbations are
    breakable.

15
Exploit predictable properties of Random data to
design a filter to break the perturbation
encryption?
All eigen-values close to 1!
Spiral data
Random data
16
Spectral Filtering Main Idea Use eigen-values
properties of noise to filter
  • UV data
  • Decomposition of eeigen-values of noise and
    original data
  • Recovered data

17
Decomposing eigen-values separating data from
noise
Let U and V be the m x n data and noise
matrices P the perturbed matrix UP UV
Covariance matrix of UP UP T UP (UV) T (UV)
UTU VTU UTV UTU
Since signal and noise are uncorrelated in random
perturbation, for large no. of observations VTU
0 and UTV 0, therefore
UP T UP UTU VTV
Since the above 3 matrices are correlation
matrices, they are symmetric and positive
semi-definite, therefore, we can perform eigen
decomposition
18
With bunch of algebra and theorems from Matrix
Perturbation theory, authors show that in the
limit (lots of data)..
  • Wigners law Describes distribution of eigen
    values for normal random matrices
  • eigen values for noise component V stick in a
    thin range given by ?min and ?max (show example
    next page) with high probability.
  • Allows us to compute ?min and ?max.

Solution!
  • Giving us the following algorithm
  • Find a large no. of eigen values of the perturbed
    data P.
  • Separate all eigen values inside ?min and ?max
    and save row indices IV
  • Take the remaining eigen indices to get the
    peturbed but not noise eigens coming from true
    data U save their row indices IU
  • Break perturbed eigenvector matrix QP into AU
    QP (IU), AV QP (IV).
  • Estimate true data as projection

19
Exploit predictable properties of Random data to
design a filter to break the perturbation
encryption?
All eigen-values close to 1!
Spiral data
Random data
20
Results Quality of Eeigen values recovery
Only the real eigens got captured, because of
the nice automatic thresholding !
21
Results Comparison with Aggarwals reproduction
Agrawal Srikant (no breaking of encryption)
Agrawal Srikant (estimated from broken
encryption)
22
Discussion
  • Amazing amount of experimental results and
    comparisons presented by authors in the Journal
    version.
  • Extension to a situation where perturbing
    distribution form is known but exact first ,
    second or higher order statistics not known
    discussed but not presented.
  • Comparison of performance with other obvious
    techniques for noise reduction in signal
    processing community
  • Moving Averages and Weiner Filtering.
  • PCA Based filtering.
  • Pros and Cons of the perturbation analysis by
    authors (and in general)
  • Effect of more and more noise rapid degradation
    of results.
  • Problem in dealing with inherent noise in
    original data.
  • Technique fails when features independent of each
    other because of Covariance matrix exploitation
    Points to a major improvement possibility in
    encryption perform ICA/PCA and then randomize?
  • Results suggest that more complex noise models
    might be harder to break. Not clear if this
    improves privacy-model quality tradeoff?
  • eigen decomposition has an inherent metric
    assumption?

23
A not-so-ominous application of noise filtering
Nulling Interferometer on Terrestrial Planet
Finder-I
Write a Comment
User Comments (0)
About PowerShow.com