PrivacyPreserving Databases and Data Mining - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

PrivacyPreserving Databases and Data Mining

Description:

In some applications where publishing wrong data is not acceptable, then unkown ... Document classification for authorship identification ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 48
Provided by: peopleSab
Category:

less

Transcript and Presenter's Notes

Title: PrivacyPreserving Databases and Data Mining


1
Privacy-Preserving Databases and Data Mining
  • Yücel SAYGIN
  • ysaygin_at_sabanciuniv.edu
  • http//people.sabanciuniv.edu/ysaygin/

2
Privacy and data mining
  • There are two aspects of data mining when we look
    at it from a privacy perspective
  • Being able to mine the data without seeing the
    actual data
  • Protecting the privacy of people against the
    misusage of data

3
How can we protect the sensitive knowledge
against data mining?
  • Types of sensitive knowledge that could be
    extracted via data mining techniques are
  • Patterns (Association rules, sequences)
  • Clusters that describe the data
  • Classification models for prediction

4
Association Rule Hiding
  • Large amounts of customer transaction data is
    collected in supermarket chains to find
    association rules in customer buying patterns
  • lots of research conducted on finding
    association rules efficiently and tools were
    developed.
  • Association rule hiding algorithms are
    deterministic with given support and confidence
    thresholds
  • Therefore association rules are a good starting
    point.

5
Motivating examples
  • Sniffing prozac users

6
Association Rule Hiding
  • Rules Body Head
  • Ex1 Diapher Beer
  • Ex2 Internetworking with TCP/IP
    Interconnections bridges, routers,
  • parameters (support, confidence)
  • Minimum Support, and Confidence Thresholds are
    used to prune the non-significant rules

7
(No Transcript)
8
Algorithms for Rule Hiding
  • What we try to achieve is
  • Let D be the source database
  • Let R be the set of significant association rules
    that are mined from D with certain thresholds
  • Let ri be a sensitive rule in R
  • Transform D into D so that all rules in R can
    still be mined from D except ri
  • It was proven that optimal hiding of association
    rules with minimal side effects is NP-Hard

9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
Classification model as a threat to privacy
20
(No Transcript)
21
Another Motivating Application
  • Given a set of attribute values that are
    confidential and therefore downgraded by
    inserting unknown values for the place of actual
    ones before being released.
  • Can someone build a classification model using
    the rest of the attributes to predict the hidden
    value?

22
(No Transcript)
23
Mining the data without actually seeing it
  • Things that we need to consider are
  • Data type
  • Data mining technique
  • Data distribution
  • Centralized
  • Distributed (vertically or horizontally)

24
Classification on perturbed data
  • Reference Rakesh Agrawal and Ramakrishnan
    Srikant. Privacy-Preserving Data Mining.
    SIGMOD, 2000, Dallas, TX.
  • They developed a technique for consturcting a
    classification model on perturbed data.
  • The data is assumed to be stored in a centralized
    database
  • And it is outsourced to a third party for mining,
    therefore the confidential values need to be
    handled
  • The following slides are based on the slides by
    the authors of the paper above

25
Reconstruction Problem
  • Original values x1, x2, ..., xn
  • from probability distribution X (unknown)
  • To hide these values, we use y1, y2, ..., yn
  • from probability distribution Y
  • Given
  • x1y1, x2y2, ..., xnyn
  • the probability distribution of Y
  • Estimate the probability distribution of X.

26
Intuition (Reconstruct single point)
  • Use Bayes' rule for density functions

27
Intuition (Reconstruct single point)
28
Reconstructing the Distribution
  • Combine estimates of where point came from for
    all the points
  • Gives estimate of original distribution.

29
Reconstruction Bootstrapping
  • fX0 Uniform distribution
  • j 0 // Iteration number
  • repeat
  • fXj1(a)
    (Bayes' rule)
  • j j1
  • until (stopping criterion met)

30
Shown to work in experiments on large data sets.
31
Algorithms
  • Global Algorithm
  • Reconstruct for each attribute once at the
    beginning
  • By Class Algorithm
  • For each attribute, first split by class, then
    reconstruct separately for each class.
  • See SIGMOD 2000 paper for details.

32
Experimental Methodology
  • Compare accuracy against
  • Original unperturbed data without randomization.
  • Randomized perturbed data but without making any
    corrections for randomization.
  • Test data not randomized.
  • Synthetic data benchmark.
  • Training set of 100,000 records, split equally
    between the two classes.

33
Quantifying Privacy
  • Add a random value between -30 and 30 to age.
  • If randomized value is 60
  • know with 90 confidence that age is between 33
    and 87.
  • Interval width ? amount of privacy.
  • Example (Interval Width 54) / (Range of Age
    100) ? 54 randomization level _at_ 90 confidence

34
Privacy Preserving Distributed Data Mining
  • Consider the case where data is distributed
    horizontally or vertically to multiple sites.
  • Each site is autonomous and does not want to
    share their actual data
  • Lets consider the following scenario
  • There are multiple hospitals that have their own
    local database,
  • and they would like to participate in a
    scientific study that will analyze the results of
    treatements for different patients
  • The privacy concern here is that, a hospital
    would not like to share the knowledge unless the
    other site also has it, to protect the privacy of
    itself and its operation
  • Another scenario
  • Two bookstores would like to learn what books are
    sold together so that they make some offers to
    their companies (Amazon does that actually)

35
Case study Association rules
  • How do we mine association rules from distributed
    sources while preserving the privacy of the data
    owners?
  • The confidential information in this case is
  • The data itself
  • The fact that a local site supports a rules with
    certain confidence and certain support (No
    company wants to loose competitive advantage, and
    would not like to reveal anything if it will not
    benefit from the release of the data)
  • Privacy preserving distributed association rule
    mining methods use distributed rule mining
    techniques

36
Distributed rule mining
  • We know how rules are mined from centralized
    databases
  • The distributed scenario is similar
  • Consider that we have only two sites S1 and S2,
    which have databases D1 (with 3 transactions) and
    D2 (with 5 transactions)

37
Distributed rule mining
  • We would like to mine the databases as if they
    are parts of a single centralized database of 8
    transactions
  • In order to do this, we need to calculate the
    local supports
  • For example the local support of A in D1 is 100
  • The local support of the itemset A,B,C in D1 is
    66, and the local support of A,B,C in D2 is
    40.

38
Distributed rule mining
  • Assume that the minimum support threshold is 50
    then A,B,C is frequent in D1, but it is not
    frequent in D2.
  • However when we assume that the databases are
    combined then the support of A,B,C in D1 U D2
    is 50
  • which means that an itemset could be locally
    frequent in one database, but not frequent in
    another database. And it can be frequent globally
  • In order for an itemset ot be frequent globally,
    it should be frequent in at least one database

39
Distributed rule mining
  • The algorithm is based on apriori which prunes
    the rules by looking at the support
  • Apriori also uses the fact that an itemset is
    frequent only if all its subsets are frequent
  • Therefore only frequent itemsets should be used
    to generated larger frequent itemsets

40
Distributed rule mining
  • The local sites will find their frequent
    itemsets.
  • They will broadcast the frequent itemsets to each
    other
  • Individual sites will count the frequencies of
    the itemsets in their local database
  • They will broadcast the result to every site
  • Every site can now find globally frequent itemsets

41
Distributed rule mining
  • Ex 50 min supp threshold
  • We will start from a singletons and calculate
    the frequencies of items
  • In D1 A (freq 3), B (freq 2), C (freq 3) are
    frequent, in D2 A (freq 4), B (freq 3), C (freq
    3) are frequent
  • They will broadcast the results to each other and
    each site will update the counts of A, B, C by
    adding the local counts

42
Distributed rule mining
  • Ex 50 min supp threshold
  • Each site will eliminate the items that are not
    globally frequent. In this case all of A, B, C
    are globally frequent. Now
  • Now using the frequent items, each site will
    generate candidates of size 2 which are A,B,
    A,C, B,C
  • And the same steps will be applied

43
Now we would like to do the same thing but
preserve the privacy of the individual sites
  • The basic notions we need for that are
  • Commutative encryption
  • And Secure multi-party computation
  • An encryption is commutative if the following two
    equations hold for any given feasible encryption
    keys K1, K2, ... Kn, any M, and any permutations
    of i,j
  • EKi1(... EKin(M)) EKKj1 (...Ekjn(M))
  • For different M1, and M2 the probablity of
    collusion is very low
  • RSA is a famous commutative encryption technique

44
A simple application of commutative encryption
  • Assume that person A has salary S1, and person B
    has salary S2.
  • How can they know wheather their salaries are
    equal to each other? (without revealing their
    salaries)
  • Assume that A, and B have their own encryption
    keys, say K1, and K2. And we go from there!

45
Distributed PP Association Rule Mining
  • For distributed association rule mining, each
    site needs to distribute its locally frequent
    itemsets to the rest of the sites
  • Instead of circulating the actual itemsets, the
    ecrypted versions are circulated
  • Example
  • S1 contains A, S2 contains B, S3 contains A. Each
    of them have their own keys, K1, K2, K3.
  • At the end of step 1, each all sites will have
    items encrypted by all sites.
  • The encrypted items are then passed to a common
    site to eliminate the duplicates and to start
    decryption. This was they will not know who has
    sent which item.
  • Decryption can now start and after everybody
    finished decrypting, then they will have the
    actual items.

46
Distributed PP Association Rule Mining
  • Now we need to see if the global support of an
    item is larger than the threshold.
  • We we do not want to reveal the supports, since
    support of an item is assumed to be confidential.
  • A secure multi-party computation technique is
    utilized for this
  • Assume that there are three sites, and each of
    them has A,B,C and freq in S1 is 5 (out of 100
    transactions), in S2 is 6 (out of 300), and in S3
    20 (out of 300), and minimum support is 5.
  • S1 selects a random number, say 17
  • S1 adds the difference 5 5x100 to 17 and sends
    the result (17) to S2
  • S2 adds 6 5x200 to 17 and sends the result
    (13) to S3.
  • S3 adds 20 5x300 to 13 and sends the result
    (18) back to S1
  • 18 gt the chosen random number (17), so A,B,C is
    globally frequent.

47
Distributed PP Association Rule Mining
  • This technique assumes a semi-honest model
  • Where each party follows the rules of the
    protocol using its correct input, but it is free
    to later use what it sees during execution of the
    protocol to compromise security.
  • Cost of encryption is the key issue since it is
    heavily used in this method.
Write a Comment
User Comments (0)
About PowerShow.com