Technology Papers Presentation in CS290: Privacy Preserving Data Mining - PowerPoint PPT Presentation

1 / 23
About This Presentation

Technology Papers Presentation in CS290: Privacy Preserving Data Mining


Biomedical Engineering Dept. at UNC-CH. 10/31/2002. Outline. Data ... Happy Halloween! Thank You. Min Wu. Biomedical Informatics Program. BME Dept. at UNC_CH ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 24
Provided by: unc


Transcript and Presenter's Notes

Title: Technology Papers Presentation in CS290: Privacy Preserving Data Mining

Technology Papers Presentation in CS290Privacy
Preserving Data Mining
Min Wu Biomedical Engineering Dept. at
UNC-CH 10/31/2002
  • Data privacy vs. Data mining
  • How to protect data privacy in data mining?
  • Watermarking relational databases
  • Rakesh Agrawal, VLDB 2002. (IBM)
  • Privacy-preserving data mining
  • Rakesh Agrawal, SIGMOD 2000 439-450. (IBM)
  • Maintaining data privacy in association rule
  • Shariq J. Rizvi, VLDB 2002. (IIT)

Data privacy vs. Data Mining
  • The primary assets of companies are the
    databases of information.
  • Privacy issues are further exacerbated now that
    the World Wide Web makes it easy for the new data
    to be automatically collected and added to
  • Data mining, with its promise to efficiently
    discover valuable, non-obvious information from
    large databases, is particularly vulnerable to

Data privacy vs. Data mining
  • Privacy and Accuracy are typically contradictory
    in nature. Fortunately, the purpose of data
    mining is essentially to identify statistically
    trends, cent-per-cent accuracy in the mining
    results is often not required.
  • Can we develop accurate models about
    aggregated data without access to precise
    information in individual data record?
  • The intentional errors for data privacy (marks,
    data perturbation, distortion) must not have a
    significant impact on the usefulness of the data.

How to protect data privacy in data mining?
  • (R 2002, IBM) Insertion and detection of
    digital watermarks in the relational databases.
  • (R 2000, IBM) Using randomizing functions for
    data perturbation in the cases of decision-tree
  • (S 2002, IIT) Using probabilistic data
    distortion in the association rule mining.

(R 2002, IBM ) Watermarking Relational Databases
  • Assumes that the marked (numeric) attributes can
    tolerate changes in some of the values.
  • The basic idea is to ensure that some bit
    positions for some of the attributes of some of
    the tuples contain specific values. That can be
    determined under the control of a private key
    known only to the owner of the relation. This bit
    pattern constitutes the watermark.
  • Only if one has access to the private key, can
    the watermark be detected with high probability.

Insert WatermarkAlgorithm
Detect watermark algorithm
Design trade-off
  • There are four important tunable parameters
  • a, the test significance level,
  • ?, the gap parameter that determines the
    fraction of tuples marked,
  • iii) ?, the number of attributes in the
    relation available for marking,
  • iv) ? , the number of least significant bits
    available for marking.

(R 2000, IBM) Privacy-Preserving Data Mining
  • The basic approach to preserving privacy is to
    let users provide a modified value for sensitive
  • Value-Class Membership method
  • The values for an attribute are partitioned into
    a set of disjoint, mutually-exclusive classes.
    Instead of a true attribute value, the user
    provides the interval in which the value lies.
    Discretization is the method used most often for
    hiding individual values.
  • Value Distortion Method
  • Return a value Zi r instead of Zi where r is a
    random value drawn from some distribution.
    (Uniform or Gaussian)

Reconstructing the original distribution
For the concept of using value distortion to
protect privacy to be useful, we need to be able
to reconstruct the original data distribution
from the randomized data.
Given a cumulative distribution F(y) and the
realizations of n random samples X 1 Y2, X2
Y2, . . .,X n Yn, estimate F x . Using Bayes'
rule to estimate the posterior distribution
function Fx (given that XiYi Wi) for Xi,
assuming we know the density functions f (x) and
f(y) for X and Y respectively.
Reconstruct Algorithm
Stop when the reconstructed distribution was
statistically the same as the original
distribution (using, say, the X 2 goodness-of-fit
Decision-Tree classification over randomized data
Induce decision tree using the reconstructed data.
Global Reconstruct the distribution for each
attribute once at the beginning using the
complete perturbed training data. ByClass For
each attribute, first split the training data by
class, then reconstruct the distributions
separately for each class. Local As in
ByClass, for each attribute, split the training
data by class and reconstruct distributions
separately for each class. However, instead of
doing reconstruction only once, reconstruction is
done at each node. To avoid over-fitting,
reconstruction is stopped after the number of
records belonging to a node become small.
To compare the classification accuracy of Global,
ByClass, and Local algorithms against each other
and with respect to the following benchmarks
Original, the result of inducing the classifier
on unperturbed training data without
randomization. Randomized, the result of
inducing the classifier on perturbed data but
without making any corrections for
randomization. Goal we want to come as close to
Original in accuracy as possible. The accuracy
gain over Randomized reflects the advantage of
Experimental results
(R 2000) Conclusion
  • For the specific case of decision-tree
    classification, they found two effective
    algorithms, ByClass and Local.
  • The algorithms rely on a Bayesian procedure for
    correcting perturbed distributions.
  • They emphasize that they reconstruct
    distributions, not individual records, thus
    preserving privacy of individual records.

(S 2002) Maintaining Data Privacy in Association
Rule Mining
  • How to support the conflicting goals of privacy
    and accuracy while mining association rules on a
    so-called market-basket databases in this
  • Assume that the market-basket database with
    the tuple being a fixed-length sequence of 1s
    and 0s.
  • Assume that overall number of 1s is
    significantly smaller than the number of 0s.

To achieve privacy is to distort the user data
before it is subject to the mining process. The
privacy metric is With what probability can a
given 1 or 0 in the true matrix be reconstructed?
Distortion Procedure
Reconstruction Probability of a 1
Round-Trip Go from the true database to the
distorted database and then return to guess the
contents of the true database.
Privacy measure
Define user privacy
Mining the distorted database (Estimating
singleton supports)
The miner is provided with D and P. Estimate T?
Experimental Results
  • The experiments indicate that by a careful
    choice of distortion probability, it is possible
    to simultaneously achieve satisfactory privacy
    and accuracy.
  • There is a small window of opportunity around
    the p0.9 value where these dual goals can be
  • Moving away from this window towards lower
    values of p, however, results in skyrocketing
    errors, while increasing the value of p will
    result in significant loss of privacy.

Happy Halloween!
  • Thank You
  • Min Wu
  • Biomedical Informatics Program
  • BME Dept. at UNC_CH
Write a Comment
User Comments (0)