PrivacyPreserving Databases and Data Mining - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

PrivacyPreserving Databases and Data Mining

Description:

In some applications where publishing wrong data is not acceptable, then unkown ... Document classification for authorship identification ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 48

Provided by: peopleSab

Learn more at: https://people.sabanciuniv.edu

Category:

more less

Transcript and Presenter's Notes

Title: PrivacyPreserving Databases and Data Mining

1
Privacy-Preserving Databases and Data Mining

Yücel SAYGIN
ysaygin_at_sabanciuniv.edu
http//people.sabanciuniv.edu/ysaygin/

2
Privacy and data mining

There are two aspects of data mining when we look
at it from a privacy perspective
Being able to mine the data without seeing the
actual data
Protecting the privacy of people against the
misusage of data

3
How can we protect the sensitive knowledge
against data mining?

Types of sensitive knowledge that could be
extracted via data mining techniques are
Patterns (Association rules, sequences)
Clusters that describe the data
Classification models for prediction

4
Association Rule Hiding

Large amounts of customer transaction data is
collected in supermarket chains to find
association rules in customer buying patterns
lots of research conducted on finding
association rules efficiently and tools were
developed.
Association rule hiding algorithms are
deterministic with given support and confidence
thresholds
Therefore association rules are a good starting
point.

5
Motivating examples

Sniffing prozac users

6
Association Rule Hiding

Rules Body Head
Ex1 Diapher Beer
Ex2 Internetworking with TCP/IP
Interconnections bridges, routers,
parameters (support, confidence)
Minimum Support, and Confidence Thresholds are
used to prune the non-significant rules

7
(No Transcript)
8
Algorithms for Rule Hiding

What we try to achieve is
Let D be the source database
Let R be the set of significant association rules
that are mined from D with certain thresholds
Let ri be a sensitive rule in R
Transform D into D so that all rules in R can
still be mined from D except ri
It was proven that optimal hiding of association
rules with minimal side effects is NP-Hard

9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
Classification model as a threat to privacy
20
(No Transcript)
21
Another Motivating Application

Given a set of attribute values that are
confidential and therefore downgraded by
inserting unknown values for the place of actual
ones before being released.
Can someone build a classification model using
the rest of the attributes to predict the hidden
value?

22
(No Transcript)
23
Mining the data without actually seeing it

Things that we need to consider are
Data type
Data mining technique
Data distribution
Centralized
Distributed (vertically or horizontally)

24
Classification on perturbed data

Reference Rakesh Agrawal and Ramakrishnan
Srikant. Privacy-Preserving Data Mining.
SIGMOD, 2000, Dallas, TX.
They developed a technique for consturcting a
classification model on perturbed data.
The data is assumed to be stored in a centralized
database
And it is outsourced to a third party for mining,
therefore the confidential values need to be
handled
The following slides are based on the slides by
the authors of the paper above

25
Reconstruction Problem

Original values x1, x2, ..., xn
from probability distribution X (unknown)
To hide these values, we use y1, y2, ..., yn
from probability distribution Y
Given
x1y1, x2y2, ..., xnyn
the probability distribution of Y
Estimate the probability distribution of X.

26
Intuition (Reconstruct single point)

Use Bayes' rule for density functions

27
Intuition (Reconstruct single point)
28
Reconstructing the Distribution

Combine estimates of where point came from for
all the points
Gives estimate of original distribution.

29
Reconstruction Bootstrapping

fX0 Uniform distribution
j 0 // Iteration number
repeat
fXj1(a)
(Bayes' rule)
j j1
until (stopping criterion met)

30
Shown to work in experiments on large data sets.
31
Algorithms

Global Algorithm
Reconstruct for each attribute once at the
beginning
By Class Algorithm
For each attribute, first split by class, then
reconstruct separately for each class.
See SIGMOD 2000 paper for details.

32
Experimental Methodology

Compare accuracy against
Original unperturbed data without randomization.
Randomized perturbed data but without making any
corrections for randomization.
Test data not randomized.
Synthetic data benchmark.
Training set of 100,000 records, split equally
between the two classes.

33
Quantifying Privacy

Add a random value between -30 and 30 to age.
If randomized value is 60
know with 90 confidence that age is between 33
and 87.
Interval width ? amount of privacy.
Example (Interval Width 54) / (Range of Age
100) ? 54 randomization level _at_ 90 confidence

34
Privacy Preserving Distributed Data Mining

Consider the case where data is distributed
horizontally or vertically to multiple sites.
Each site is autonomous and does not want to
share their actual data
Lets consider the following scenario
There are multiple hospitals that have their own
local database,
and they would like to participate in a
scientific study that will analyze the results of
treatements for different patients
The privacy concern here is that, a hospital
would not like to share the knowledge unless the
other site also has it, to protect the privacy of
itself and its operation
Another scenario
Two bookstores would like to learn what books are
sold together so that they make some offers to
their companies (Amazon does that actually)

35
Case study Association rules

How do we mine association rules from distributed
sources while preserving the privacy of the data
owners?
The confidential information in this case is
The data itself
The fact that a local site supports a rules with
certain confidence and certain support (No
company wants to loose competitive advantage, and
would not like to reveal anything if it will not
benefit from the release of the data)
Privacy preserving distributed association rule
mining methods use distributed rule mining
techniques

36
Distributed rule mining

We know how rules are mined from centralized
databases
The distributed scenario is similar
Consider that we have only two sites S1 and S2,
which have databases D1 (with 3 transactions) and
D2 (with 5 transactions)

37
Distributed rule mining

We would like to mine the databases as if they
are parts of a single centralized database of 8
transactions
In order to do this, we need to calculate the
local supports
For example the local support of A in D1 is 100
The local support of the itemset A,B,C in D1 is
66, and the local support of A,B,C in D2 is
40.

38
Distributed rule mining

Assume that the minimum support threshold is 50
then A,B,C is frequent in D1, but it is not
frequent in D2.
However when we assume that the databases are
combined then the support of A,B,C in D1 U D2
is 50
which means that an itemset could be locally
frequent in one database, but not frequent in
another database. And it can be frequent globally
In order for an itemset ot be frequent globally,
it should be frequent in at least one database

39
Distributed rule mining

The algorithm is based on apriori which prunes
the rules by looking at the support
Apriori also uses the fact that an itemset is
frequent only if all its subsets are frequent
Therefore only frequent itemsets should be used
to generated larger frequent itemsets

40
Distributed rule mining

The local sites will find their frequent
itemsets.
They will broadcast the frequent itemsets to each
other
Individual sites will count the frequencies of
the itemsets in their local database
They will broadcast the result to every site
Every site can now find globally frequent itemsets

41
Distributed rule mining

Ex 50 min supp threshold
We will start from a singletons and calculate
the frequencies of items
In D1 A (freq 3), B (freq 2), C (freq 3) are
frequent, in D2 A (freq 4), B (freq 3), C (freq
3) are frequent
They will broadcast the results to each other and
each site will update the counts of A, B, C by
adding the local counts

42
Distributed rule mining

Ex 50 min supp threshold
Each site will eliminate the items that are not
globally frequent. In this case all of A, B, C
are globally frequent. Now
Now using the frequent items, each site will
generate candidates of size 2 which are A,B,
A,C, B,C
And the same steps will be applied

43
Now we would like to do the same thing but
preserve the privacy of the individual sites

The basic notions we need for that are
Commutative encryption
And Secure multi-party computation
An encryption is commutative if the following two
equations hold for any given feasible encryption
keys K1, K2, ... Kn, any M, and any permutations
of i,j
EKi1(... EKin(M)) EKKj1 (...Ekjn(M))
For different M1, and M2 the probablity of
collusion is very low
RSA is a famous commutative encryption technique

44
A simple application of commutative encryption

Assume that person A has salary S1, and person B
has salary S2.
How can they know wheather their salaries are
equal to each other? (without revealing their
salaries)
Assume that A, and B have their own encryption
keys, say K1, and K2. And we go from there!

45
Distributed PP Association Rule Mining

For distributed association rule mining, each
site needs to distribute its locally frequent
itemsets to the rest of the sites
Instead of circulating the actual itemsets, the
ecrypted versions are circulated
Example
S1 contains A, S2 contains B, S3 contains A. Each
of them have their own keys, K1, K2, K3.
At the end of step 1, each all sites will have
items encrypted by all sites.
The encrypted items are then passed to a common
site to eliminate the duplicates and to start
decryption. This was they will not know who has
sent which item.
Decryption can now start and after everybody
finished decrypting, then they will have the
actual items.

46
Distributed PP Association Rule Mining

Now we need to see if the global support of an
item is larger than the threshold.
We we do not want to reveal the supports, since
support of an item is assumed to be confidential.
A secure multi-party computation technique is
utilized for this
Assume that there are three sites, and each of
them has A,B,C and freq in S1 is 5 (out of 100
transactions), in S2 is 6 (out of 300), and in S3
20 (out of 300), and minimum support is 5.
S1 selects a random number, say 17
S1 adds the difference 5 5x100 to 17 and sends
the result (17) to S2
S2 adds 6 5x200 to 17 and sends the result
(13) to S3.
S3 adds 20 5x300 to 13 and sends the result
(18) back to S1
18 gt the chosen random number (17), so A,B,C is
globally frequent.

47
Distributed PP Association Rule Mining

This technique assumes a semi-honest model
Where each party follows the rules of the
protocol using its correct input, but it is free
to later use what it sees during execution of the
protocol to compromise security.
Cost of encryption is the key issue since it is
heavily used in this method.

Write a Comment

User Comments (0)