Title: Technology Papers Presentation in CS290: Privacy Preserving Data Mining
1Technology Papers Presentation in CS290Privacy
Preserving Data Mining
Min Wu Biomedical Engineering Dept. at
UNC-CH 10/31/2002
2Outline
- Data privacy vs. Data mining
- How to protect data privacy in data mining?
- Watermarking relational databases
- Rakesh Agrawal, VLDB 2002. (IBM)
- Privacy-preserving data mining
- Rakesh Agrawal, SIGMOD 2000 439-450. (IBM)
- Maintaining data privacy in association rule
mining - Shariq J. Rizvi, VLDB 2002. (IIT)
3Data privacy vs. Data Mining
- The primary assets of companies are the
databases of information. - Privacy issues are further exacerbated now that
the World Wide Web makes it easy for the new data
to be automatically collected and added to
databases. - Data mining, with its promise to efficiently
discover valuable, non-obvious information from
large databases, is particularly vulnerable to
misuse.
4Data privacy vs. Data mining
- Privacy and Accuracy are typically contradictory
in nature. Fortunately, the purpose of data
mining is essentially to identify statistically
trends, cent-per-cent accuracy in the mining
results is often not required. - Can we develop accurate models about
aggregated data without access to precise
information in individual data record? - The intentional errors for data privacy (marks,
data perturbation, distortion) must not have a
significant impact on the usefulness of the data.
5How to protect data privacy in data mining?
- (R 2002, IBM) Insertion and detection of
digital watermarks in the relational databases. - (R 2000, IBM) Using randomizing functions for
data perturbation in the cases of decision-tree
classification. - (S 2002, IIT) Using probabilistic data
distortion in the association rule mining.
6(R 2002, IBM ) Watermarking Relational Databases
- Assumes that the marked (numeric) attributes can
tolerate changes in some of the values. - The basic idea is to ensure that some bit
positions for some of the attributes of some of
the tuples contain specific values. That can be
determined under the control of a private key
known only to the owner of the relation. This bit
pattern constitutes the watermark. - Only if one has access to the private key, can
the watermark be detected with high probability.
7Insert WatermarkAlgorithm
8Detect watermark algorithm
9Design trade-off
- There are four important tunable parameters
- a, the test significance level,
- ?, the gap parameter that determines the
fraction of tuples marked, - iii) ?, the number of attributes in the
relation available for marking, - iv) ? , the number of least significant bits
available for marking.
10 (R 2000, IBM) Privacy-Preserving Data Mining
- The basic approach to preserving privacy is to
let users provide a modified value for sensitive
attributes. - Value-Class Membership method
- The values for an attribute are partitioned into
a set of disjoint, mutually-exclusive classes.
Instead of a true attribute value, the user
provides the interval in which the value lies.
Discretization is the method used most often for
hiding individual values. - Value Distortion Method
- Return a value Zi r instead of Zi where r is a
random value drawn from some distribution.
(Uniform or Gaussian)
11Reconstructing the original distribution
For the concept of using value distortion to
protect privacy to be useful, we need to be able
to reconstruct the original data distribution
from the randomized data.
Given a cumulative distribution F(y) and the
realizations of n random samples X 1 Y2, X2
Y2, . . .,X n Yn, estimate F x . Using Bayes'
rule to estimate the posterior distribution
function Fx (given that XiYi Wi) for Xi,
assuming we know the density functions f (x) and
f(y) for X and Y respectively.
12Reconstruct Algorithm
Stop when the reconstructed distribution was
statistically the same as the original
distribution (using, say, the X 2 goodness-of-fit
test)
13Decision-Tree classification over randomized data
Induce decision tree using the reconstructed data.
Global Reconstruct the distribution for each
attribute once at the beginning using the
complete perturbed training data. ByClass For
each attribute, first split the training data by
class, then reconstruct the distributions
separately for each class. Local As in
ByClass, for each attribute, split the training
data by class and reconstruct distributions
separately for each class. However, instead of
doing reconstruction only once, reconstruction is
done at each node. To avoid over-fitting,
reconstruction is stopped after the number of
records belonging to a node become small.
14Experiments
To compare the classification accuracy of Global,
ByClass, and Local algorithms against each other
and with respect to the following benchmarks
Original, the result of inducing the classifier
on unperturbed training data without
randomization. Randomized, the result of
inducing the classifier on perturbed data but
without making any corrections for
randomization. Goal we want to come as close to
Original in accuracy as possible. The accuracy
gain over Randomized reflects the advantage of
reconstruction.
15Experimental results
16(R 2000) Conclusion
- For the specific case of decision-tree
classification, they found two effective
algorithms, ByClass and Local. - The algorithms rely on a Bayesian procedure for
correcting perturbed distributions. - They emphasize that they reconstruct
distributions, not individual records, thus
preserving privacy of individual records.
17(S 2002) Maintaining Data Privacy in Association
Rule Mining
- How to support the conflicting goals of privacy
and accuracy while mining association rules on a
so-called market-basket databases in this
paper? - Assume that the market-basket database with
the tuple being a fixed-length sequence of 1s
and 0s. - Assume that overall number of 1s is
significantly smaller than the number of 0s.
To achieve privacy is to distort the user data
before it is subject to the mining process. The
privacy metric is With what probability can a
given 1 or 0 in the true matrix be reconstructed?
18Distortion Procedure
19Reconstruction Probability of a 1
Round-Trip Go from the true database to the
distorted database and then return to guess the
contents of the true database.
20Privacy measure
Define user privacy
21Mining the distorted database (Estimating
singleton supports)
The miner is provided with D and P. Estimate T?
22Experimental Results
- The experiments indicate that by a careful
choice of distortion probability, it is possible
to simultaneously achieve satisfactory privacy
and accuracy. - There is a small window of opportunity around
the p0.9 value where these dual goals can be
met. - Moving away from this window towards lower
values of p, however, results in skyrocketing
errors, while increasing the value of p will
result in significant loss of privacy.
23Happy Halloween!
- Thank You
- Min Wu
- Biomedical Informatics Program
- BME Dept. at UNC_CH