Privacy%20Preserving%20Mining%20of%20Association%20Rules - PowerPoint PPT Presentation

About This Presentation
Title:

Privacy%20Preserving%20Mining%20of%20Association%20Rules

Description:

Data Mining and Privacy ... B. Marley, camping, linux.org, Randomization Overview. Associations Recap ... 5% transactions contain beer and diapers; ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 57
Provided by: almad8
Category:

less

Transcript and Presenter's Notes

Title: Privacy%20Preserving%20Mining%20of%20Association%20Rules


1
Privacy Preserving Mining of Association Rules
  • Alexandre Evfimievski,
  • Ramakrishnan Srikant,
  • Rakesh Agrawal,
  • Johannes Gehrke

IBM Almaden Research Center Cornell University
2
Data Mining and Privacy
  • The primary task in data mining development of
    models about aggregated data.
  • Can we develop accurate models without access to
    precise information in individual data records?

3
Data Mining and Privacy
  • The primary task in data mining development of
    models about aggregated data.
  • Can we develop accurate models without access to
    precise information in individual data records?
  • Answer yes, by randomization.
  • R. Agrawal, R. Srikant Privacy Preserving Data
    Mining, SIGMOD 2000
  • for numerical attributes, classification
  • How about association rules?

4
Randomization Overview
Alice
J.S. Bach, painting, nasa.gov,
Recommendation Service
Bob
B. Spears, baseball, cnn.com,
Chris
B. Marley, camping, linux.org,
5
Randomization Overview
Alice
J.S. Bach, painting, nasa.gov,
J.S. Bach, painting, nasa.gov,
Recommendation Service
B. Spears, baseball, cnn.com,
Bob
B. Spears, baseball, cnn.com,
B. Marley, camping, linux.org,
Chris
B. Marley, camping, linux.org,
6
Randomization Overview
Alice
J.S. Bach, painting, nasa.gov,
J.S. Bach, painting, nasa.gov,
Recommendation Service
B. Spears, baseball, cnn.com,
Bob
Associations
B. Spears, baseball, cnn.com,
B. Marley, camping, linux.org,
Recommendations
Chris
B. Marley, camping, linux.org,
7
Randomization Overview
Alice
J.S. Bach, painting, nasa.gov,
Metallica, painting, nasa.gov,
Recommendation Service
Support Recovery
B. Spears, soccer, bbc.co.uk,
Bob
Associations
B. Spears, baseball, cnn.com,
B. Marley, camping, microsoft.com
Recommendations
Chris
B. Marley, camping, linux.org,
8
Associations Recap
  • A transaction t is a set of items (e.g. books)
  • All transactions form a set T of transactions
  • Any itemset A has support s in T if
  • Itemset A is frequent if s ? smin
  • If A ? B , then supp (A) ? supp (B).

9
Associations Recap
  • A transaction t is a set of items (e.g. books)
  • All transactions form a set T of transactions
  • Any itemset A has support s in T if
  • Itemset A is frequent if s ? smin
  • If A ? B , then supp (A) ? supp (B).
  • Example
  • 20 transactions contain beer,
  • 5 transactions contain beer and diapers
  • Then confidence of beer ? diapers is 5/20
    0.25 25.

10
The Problem
  • How to randomize transactions so that
  • we can find frequent itemsets
  • while preserving privacy at transaction level?

11
Talk Outline
  • Introduction
  • Privacy Breaches
  • Our Solution
  • Experiments
  • Conclusion

12
Uniform Randomization
  • Given a transaction,
  • keep item with 20 probability,
  • replace with a new random item with 80
    probability.

13
Example x, y, z
10 M transactions of size 10 with 10 K items
1 have x, y, z
5 have x, y, x, z, or y, z only
94 have one or zero items of x, y, z
14
Example x, y, z
10 M transactions of size 10 with 10 K items
1 have x, y, z
5 have x, y, x, z, or y, z only
94 have one or zero items of x, y, z
Uniform randomization How many have x, y, z ?
15
Example x, y, z
10 M transactions of size 10 with 10 K items
1 have x, y, z
5 have x, y, x, z, or y, z only
94 have one or zero items of x, y, z
at most 0.2 (9/10,000)2
0.22 8/10,000
0.23
0.008 800 ts.
0.00016 16 trans.
less than 0.00002 2 transactions
Uniform randomization How many have x, y, z ?
16
Example x, y, z
10 M transactions of size 10 with 10 K items
1 have x, y, z
5 have x, y, x, z, or y, z only
94 have one or zero items of x, y, z
at most 0.2 (9/10,000)2
0.22 8/10,000
0.23
0.008 800 ts. 97.8
0.00016 16 trans. 1.9
less than 0.00002 2 transactions 0.3
Uniform randomization How many have x, y, z ?
17
Example x, y, z
  • Given nothing, we have only 1 probability that
    x, y, z occurs in the original transaction
  • Given x, y, z in the randomized transaction,
    we have about 98 certainty of x, y, z in the
    original one.
  • This is what we call a privacy breach.
  • Uniform randomization preserves privacy on
    average, but not in the worst case.

18
Privacy Breaches
  • Suppose
  • t is an original transaction
  • t is the corresponding randomized transaction
  • A is a (frequent) itemset.
  • Definition Itemset A causes a privacy breach
    of level ? (e.g. 50) if, for some item z ? A,
  • Assumption no external information besides t.

19
Talk Outline
  • Introduction
  • Privacy Breaches
  • Our Solution
  • Experiments
  • Conclusion

20
Our Solution
Where does a wise man hide a leaf? In the
forest. But what does he do if there is no
forest? He grows a forest to hide it
in. G.K. Chesterton
  • Insert many false items into each transaction
  • Hide true itemsets among false ones

21
Our Solution
Where does a wise man hide a leaf? In the
forest. But what does he do if there is no
forest? He grows a forest to hide it
in. G.K. Chesterton
  • Insert many false items into each transaction
  • Hide true itemsets among false ones
  • Can we still find frequent itemsets while having
    sufficient privacy?

22
Definition of cut-and-paste
  • Given transaction t of size m, construct t

a, b, c, u, v, w, x, y, z
t
t
23
Definition of cut-and-paste
  • Given transaction t of size m, construct t
  • Choose a number j between 0 and Km
    (cutoff)

a, b, c, u, v, w, x, y, z
t
t
j 4
24
Definition of cut-and-paste
  • Given transaction t of size m, construct t
  • Choose a number j between 0 and Km
    (cutoff)
  • Include j items of t into t

a, b, c, u, v, w, x, y, z
t
b, v, x, z
t
j 4
25
Definition of cut-and-paste
  • Given transaction t of size m, construct t
  • Choose a number j between 0 and Km
    (cutoff)
  • Include j items of t into t
  • Each other item is included into t with
    probability pm .
  • The choice of Km and pm is based on the
    desired level of privacy.

a, b, c, u, v, w, x, y, z
t
b, v, x, z
t
œ, å, ß, ?, ?, , ?, ?, ?,
j 4
26
Partial Supports
  • To recover original support of an itemset, we
    need randomized supports of its subsets.
  • Given an itemset A of size k and transaction
    size m,
  • A vector of partial supports of A is
  • Here sk is the same as the support of A.
  • Randomized partial supports are denoted by

27
Transition Matrix
  • Let k A, m t.
  • Transition matrix P P (k, m) connects
    randomized partial supports with original ones
  • Randomized supports are distributed as a sum of
    multinomial distributions.

28
The Unbiased Estimators
  • Given randomized partial supports, we can
    estimate original partial supports

29
The Unbiased Estimators
  • Given randomized partial supports, we can
    estimate original partial supports
  • Covariance matrix for this estimator
  • To estimate it, substitute sl with (sest)l .
  • Special case estimators for support and its
    variance

30
Class of Randomizations
  • Our analysis works for any randomization that
    satisfies two properties
  • A per-transaction randomization applies the same
    procedure to each transaction, using no
    information about other transactions
  • An item-invariant randomization does not depend
    on any ordering or naming of items.

31
Class of Randomizations
  • Our analysis works for any randomization that
    satisfies two properties
  • A per-transaction randomization applies the same
    procedure to each transaction, using no
    information about other transactions
  • An item-invariant randomization does not depend
    on any ordering or naming of items.
  • Both uniform and cut-and-paste randomizations
    satisfy these two properties.

32
Apriori
  • Let k 1, candidate sets all 1-itemsets.
  • Repeat
  • Count support for all candidate sets
  • Output the candidate sets with support ? smin
  • New candidate sets all (k 1)-itemsets s.t.
    all their k-subsets are candidate sets with
    support ? smin
  • Let k k 1
  • Stop when there are no more candidate sets.

33
The Modified Apriori
  • Let k 1, candidate sets all 1-itemsets.
  • Repeat
  • Estimate support and variance (s2) for all
    candidate sets
  • Output the candidate sets with support ? smin
  • New candidate sets all (k 1)-itemsets s.t.
    all their k-subsets are candidate sets with
    support ? smin - s
  • Let k k 1
  • Stop when there are no more candidate sets, or
    the estimators precision becomes unsatisfactory.

34
Privacy Breach Analysis
  • How many added items are enough to protect
    privacy?
  • Have to satisfy Pr z ? t A ? t lt ? (? no
    privacy breaches)
  • Select parameters so that it holds for all
    itemsets.
  • Use formula (
    )

35
Privacy Breach Analysis
  • How many added items are enough to protect
    privacy?
  • Have to satisfy Pr z ? t A ? t lt ? (? no
    privacy breaches)
  • Select parameters so that it holds for all
    itemsets.
  • Use formula (
    )
  • Parameters are to be selected in advance!
  • Construct a privacy-challenging test an itemset
    whose all subsets have maximum possible support.
  • Enough to know maximal support of an itemset for
    each size.

36
Graceful Tradeoff
  • Want more precision or more privacy?
  • Adjust privacy breach level
  • A small relaxation of privacy restrictions
    results in a small increase in precision of
    estimators.

37
Talk Outline
  • Introduction
  • Privacy Breaches
  • Our Solution
  • Experiments
  • Support recovery vs. parameters
  • Real-life data
  • Conclusion

38
Lowest Discoverable Support
  • LDS is s.t., when predicted, is 4? away from
    zero.
  • Roughly, LDS is proportional to

t 5, ? 50
39
LDS vs. Breach Level
t 5, T 5 M
  • Reminder breach level is the limit on Pr z ?
    t A ? t

40
Talk Outline
  • Introduction
  • Privacy Breaches
  • Our Solution
  • Experiments
  • Support recovery vs. parameters
  • Real-life data
  • Conclusion

41
Real datasets soccer, mailorder
  • Soccer is the clickstream log of WorldCup98 web
    site, split into sessions of HTML requests.
  • 11 K items (HTMLs), 6.5 M transactions
  • Available at http//www.acm.org/sigcomm/ITA/
  • Mailorder is a purchase dataset from a certain
    on-line store
  • Products are replaced with their categories
  • 96 items (categories), 2.9 M transactions
  • A small fraction of transactions are discarded as
    too long.
  • longer than 10 (for soccer) or 7 (for mailorder)

42
Modified Apriori on Real Data
Breach level 50. Inserted 20-50 items to
each transaction.
Itemset Size True Itemsets True Positives False Drops False Positives
1 266 254 12 31
2 217 195 22 45
3 48 43 5 26
Soccer smin 0.2 ? ? 0.07 for 3-itemsets
Mailorder smin 0.2 ? ? 0.05 for 3-itemsets
Itemset Size True Itemsets True Positives False Drops False Positives
1 65 65 0 0
2 228 212 16 28
3 22 18 4 5
43
False Drops False Positives
Soccer
Pred. supp, when true supp ? 0.2
True supp, when pred. supp ? 0.2
Size lt 0.1 0.1-0.15 0.15-0.2 ?0.2
1 0 2 10 254
2 0 5 17 195
3 0 1 4 43
Size lt 0.1 0.1-0.15 0.15-0.2 ?0.2
1 0 7 24 254
2 7 10 28 195
3 5 13 8 43
Mailorder
Pred. supp, when true supp ? 0.2
True supp, when pred. supp ? 0.2
Size lt 0.1 0.1-0.15 0.15-0.2 ?0.2
1 0 0 0 65
2 0 1 15 212
3 0 1 3 18
Size lt 0.1 0.1-0.15 0.15-0.2 ?0.2
1 0 0 0 65
2 0 0 28 212
3 1 2 2 18
44
Actual Privacy Breaches
  • Verified actual privacy breach levels
  • The breach probabilities are counted in the
    datasets for frequent and near-frequent itemsets.
  • If maximum supports were estimated correctly,
    even worst-case breach levels fluctuated around
    50
  • At most 53.2 for soccer,
  • At most 55.4 for mailorder.

45
Talk Outline
  • Introduction
  • Privacy Breaches
  • Our Solution
  • Experiments
  • Conclusion

46
Summary
  • Privacy breaches identified problem and provided
    a solution for controlling breaches
  • Derived estimators of support and variance for a
    class of randomization operators
  • Algorithm for discovering associations in
    randomized data
  • Validated on real-life datasets
  • Can find associations while preserving privacy at
    the level of individual transactions

47
Future Work
  • Control of more general privacy breaches
  • What about other properties of transactions, for
    example item z ? t breach caused by A ? t ?
    ?
  • What about external information?
  • Theoretical limits of discoverability for a given
    privacy breach level
  • How to compute theoretical limits?
  • How to attain them by an algorithm?

48
Thank You!
49
BACK-UPS
50
Our Solution Example
  • Old set-up
  • Given 10,000 items, 10 M transactions of size 10
  • 100,000 transactions (1) contain A x, y, z
  • In addition to uniform randomization with p
    80, insert 500 new random items to each
    transaction.
  • 800 transactions contain x, y, z before and
    after
  • Roughly (10 M) (500 / 10,000)3 1250
    transactions contain none before and full x, y,
    z after.
  • Presence of x, y, z in a randomized transaction
    now says little about the original transaction.

51
Privacy Breach Analysis
  • GIVEN itemset A, and item z ? A
  • WANTED
  • Assume that partial supports are probabilities
  • Define
  • Then we have

52
Limiting Privacy Breaches
  • We want to make sure that always
  • But we do not know supports in advance.
  • Solution For each itemset size k, give
    privacy-challenging test values to .
  • It is an itemset whose subsets have maximum
    supports
  • We need to estimate maximum support values prior
    to randomization

53
LDS vs. Transaction Size
? 50, T 5 M
  • Too long transactions cannot be used for
    prediction

54
Related Work
  • R. Agrawal, R. Srikant Privacy Preserving Data
    Mining, SIGMOD 2000
  • Each client has a numerical attribute xi
  • Client i sends xi yi , where yi random
    offset, with known distribution
  • Server reconstructs the distribution of original
    attributes ( EM algorithm)
  • The distribution is then used for classification
  • Numerical attributes only

55
Related Work
  • Y. Lindell and B. Pinkas Privacy Preserving Data
    Mining, Crypto 2000
  • J. Vaidya and C. Clifton Privacy Preserving
    Association Rule Mining in Vertically Partitioned
    Data

56
Privacy Concern
  • Popular press
  • The End of Privacy, The Death of Privacy
  • Government directives
  • European directive on privacy protection (Oct 98)
  • Canadian Personal Information Protection Act (Jan
    2001)
  • Surveys of Web users
  • 17 fundamentalists, 56 pragmatic majority, 27
    marginally concerned (April 99)
  • 82 said having privacy would matter (July 99)
Write a Comment
User Comments (0)
About PowerShow.com