Title: Privacy%20Preserving%20Mining%20of%20Association%20Rules
1Privacy Preserving Mining of Association Rules
- Alexandre Evfimievski,
- Ramakrishnan Srikant,
- Rakesh Agrawal,
- Johannes Gehrke
IBM Almaden Research Center Cornell University
2Data Mining and Privacy
- The primary task in data mining development of
models about aggregated data. - Can we develop accurate models without access to
precise information in individual data records?
3Data Mining and Privacy
- The primary task in data mining development of
models about aggregated data. - Can we develop accurate models without access to
precise information in individual data records? - Answer yes, by randomization.
- R. Agrawal, R. Srikant Privacy Preserving Data
Mining, SIGMOD 2000 - for numerical attributes, classification
- How about association rules?
4Randomization Overview
J.S. Bach, painting, nasa.gov,
Recommendation Service
B. Spears, baseball, cnn.com,
B. Marley, camping, linux.org,
5Randomization Overview
J.S. Bach, painting, nasa.gov,
J.S. Bach, painting, nasa.gov,
Recommendation Service
B. Spears, baseball, cnn.com,
B. Spears, baseball, cnn.com,
B. Marley, camping, linux.org,
B. Marley, camping, linux.org,
6Randomization Overview
J.S. Bach, painting, nasa.gov,
J.S. Bach, painting, nasa.gov,
Recommendation Service
B. Spears, baseball, cnn.com,
B. Spears, baseball, cnn.com,
B. Marley, camping, linux.org,
B. Marley, camping, linux.org,
7Randomization Overview
J.S. Bach, painting, nasa.gov,
Metallica, painting, nasa.gov,
Recommendation Service
Support Recovery
B. Spears, soccer, bbc.co.uk,
B. Spears, baseball, cnn.com,
B. Marley, camping, microsoft.com
B. Marley, camping, linux.org,
8Associations Recap
- A transaction t is a set of items (e.g. books)
- All transactions form a set T of transactions
- Any itemset A has support s in T if
- Itemset A is frequent if s ? smin
- If A ? B , then supp (A) ? supp (B).
9Associations Recap
- A transaction t is a set of items (e.g. books)
- All transactions form a set T of transactions
- Any itemset A has support s in T if
- Itemset A is frequent if s ? smin
- If A ? B , then supp (A) ? supp (B).
- Example
- 20 transactions contain beer,
- 5 transactions contain beer and diapers
- Then confidence of beer ? diapers is 5/20
0.25 25.
10The Problem
- How to randomize transactions so that
- we can find frequent itemsets
- while preserving privacy at transaction level?
11Talk Outline
- Introduction
- Privacy Breaches
- Our Solution
- Experiments
- Conclusion
12Uniform Randomization
- Given a transaction,
- keep item with 20 probability,
- replace with a new random item with 80
13Example x, y, z
10 M transactions of size 10 with 10 K items
1 have x, y, z
5 have x, y, x, z, or y, z only
94 have one or zero items of x, y, z
14Example x, y, z
10 M transactions of size 10 with 10 K items
1 have x, y, z
5 have x, y, x, z, or y, z only
94 have one or zero items of x, y, z
Uniform randomization How many have x, y, z ?
15Example x, y, z
10 M transactions of size 10 with 10 K items
1 have x, y, z
5 have x, y, x, z, or y, z only
94 have one or zero items of x, y, z
at most 0.2 (9/10,000)2
0.22 8/10,000
0.008 800 ts.
0.00016 16 trans.
less than 0.00002 2 transactions
Uniform randomization How many have x, y, z ?
16Example x, y, z
10 M transactions of size 10 with 10 K items
1 have x, y, z
5 have x, y, x, z, or y, z only
94 have one or zero items of x, y, z
at most 0.2 (9/10,000)2
0.22 8/10,000
0.008 800 ts. 97.8
0.00016 16 trans. 1.9
less than 0.00002 2 transactions 0.3
Uniform randomization How many have x, y, z ?
17Example x, y, z
- Given nothing, we have only 1 probability that
x, y, z occurs in the original transaction - Given x, y, z in the randomized transaction,
we have about 98 certainty of x, y, z in the
original one. - This is what we call a privacy breach.
- Uniform randomization preserves privacy on
average, but not in the worst case.
18Privacy Breaches
- Suppose
- t is an original transaction
- t is the corresponding randomized transaction
- A is a (frequent) itemset.
- Definition Itemset A causes a privacy breach
of level ? (e.g. 50) if, for some item z ? A, - Assumption no external information besides t.
19Talk Outline
- Introduction
- Privacy Breaches
- Our Solution
- Experiments
- Conclusion
20Our Solution
Where does a wise man hide a leaf? In the
forest. But what does he do if there is no
forest? He grows a forest to hide it
in. G.K. Chesterton
- Insert many false items into each transaction
- Hide true itemsets among false ones
21Our Solution
Where does a wise man hide a leaf? In the
forest. But what does he do if there is no
forest? He grows a forest to hide it
in. G.K. Chesterton
- Insert many false items into each transaction
- Hide true itemsets among false ones
- Can we still find frequent itemsets while having
sufficient privacy?
22Definition of cut-and-paste
- Given transaction t of size m, construct t
a, b, c, u, v, w, x, y, z
23Definition of cut-and-paste
- Given transaction t of size m, construct t
- Choose a number j between 0 and Km
a, b, c, u, v, w, x, y, z
j 4
24Definition of cut-and-paste
- Given transaction t of size m, construct t
- Choose a number j between 0 and Km
(cutoff) - Include j items of t into t
a, b, c, u, v, w, x, y, z
b, v, x, z
j 4
25Definition of cut-and-paste
- Given transaction t of size m, construct t
- Choose a number j between 0 and Km
(cutoff) - Include j items of t into t
- Each other item is included into t with
probability pm . - The choice of Km and pm is based on the
desired level of privacy.
a, b, c, u, v, w, x, y, z
b, v, x, z
œ, å, ß, ?, ?, , ?, ?, ?,
j 4
26Partial Supports
- To recover original support of an itemset, we
need randomized supports of its subsets. - Given an itemset A of size k and transaction
size m, - A vector of partial supports of A is
- Here sk is the same as the support of A.
- Randomized partial supports are denoted by
27Transition Matrix
- Let k A, m t.
- Transition matrix P P (k, m) connects
randomized partial supports with original ones - Randomized supports are distributed as a sum of
multinomial distributions.
28The Unbiased Estimators
- Given randomized partial supports, we can
estimate original partial supports
29The Unbiased Estimators
- Given randomized partial supports, we can
estimate original partial supports - Covariance matrix for this estimator
- To estimate it, substitute sl with (sest)l .
- Special case estimators for support and its
30Class of Randomizations
- Our analysis works for any randomization that
satisfies two properties - A per-transaction randomization applies the same
procedure to each transaction, using no
information about other transactions - An item-invariant randomization does not depend
on any ordering or naming of items.
31Class of Randomizations
- Our analysis works for any randomization that
satisfies two properties - A per-transaction randomization applies the same
procedure to each transaction, using no
information about other transactions - An item-invariant randomization does not depend
on any ordering or naming of items. - Both uniform and cut-and-paste randomizations
satisfy these two properties.
- Let k 1, candidate sets all 1-itemsets.
- Repeat
- Count support for all candidate sets
- Output the candidate sets with support ? smin
- New candidate sets all (k 1)-itemsets s.t.
all their k-subsets are candidate sets with
support ? smin - Let k k 1
- Stop when there are no more candidate sets.
33The Modified Apriori
- Let k 1, candidate sets all 1-itemsets.
- Repeat
- Estimate support and variance (s2) for all
candidate sets - Output the candidate sets with support ? smin
- New candidate sets all (k 1)-itemsets s.t.
all their k-subsets are candidate sets with
support ? smin - s - Let k k 1
- Stop when there are no more candidate sets, or
the estimators precision becomes unsatisfactory.
34Privacy Breach Analysis
- How many added items are enough to protect
privacy? - Have to satisfy Pr z ? t A ? t lt ? (? no
privacy breaches) - Select parameters so that it holds for all
itemsets. - Use formula (
35Privacy Breach Analysis
- How many added items are enough to protect
privacy? - Have to satisfy Pr z ? t A ? t lt ? (? no
privacy breaches) - Select parameters so that it holds for all
itemsets. - Use formula (
) - Parameters are to be selected in advance!
- Construct a privacy-challenging test an itemset
whose all subsets have maximum possible support. - Enough to know maximal support of an itemset for
each size.
36Graceful Tradeoff
- Want more precision or more privacy?
- Adjust privacy breach level
- A small relaxation of privacy restrictions
results in a small increase in precision of
37Talk Outline
- Introduction
- Privacy Breaches
- Our Solution
- Experiments
- Support recovery vs. parameters
- Real-life data
- Conclusion
38Lowest Discoverable Support
- LDS is s.t., when predicted, is 4? away from
zero. - Roughly, LDS is proportional to
t 5, ? 50
39LDS vs. Breach Level
t 5, T 5 M
- Reminder breach level is the limit on Pr z ?
t A ? t
40Talk Outline
- Introduction
- Privacy Breaches
- Our Solution
- Experiments
- Support recovery vs. parameters
- Real-life data
- Conclusion
41Real datasets soccer, mailorder
- Soccer is the clickstream log of WorldCup98 web
site, split into sessions of HTML requests. - 11 K items (HTMLs), 6.5 M transactions
- Available at http//www.acm.org/sigcomm/ITA/
- Mailorder is a purchase dataset from a certain
on-line store - Products are replaced with their categories
- 96 items (categories), 2.9 M transactions
- A small fraction of transactions are discarded as
too long. - longer than 10 (for soccer) or 7 (for mailorder)
42Modified Apriori on Real Data
Breach level 50. Inserted 20-50 items to
each transaction.
Itemset Size True Itemsets True Positives False Drops False Positives
1 266 254 12 31
2 217 195 22 45
3 48 43 5 26
Soccer smin 0.2 ? ? 0.07 for 3-itemsets
Mailorder smin 0.2 ? ? 0.05 for 3-itemsets
Itemset Size True Itemsets True Positives False Drops False Positives
1 65 65 0 0
2 228 212 16 28
3 22 18 4 5
43False Drops False Positives
Pred. supp, when true supp ? 0.2
True supp, when pred. supp ? 0.2
Size lt 0.1 0.1-0.15 0.15-0.2 ?0.2
1 0 2 10 254
2 0 5 17 195
3 0 1 4 43
Size lt 0.1 0.1-0.15 0.15-0.2 ?0.2
1 0 7 24 254
2 7 10 28 195
3 5 13 8 43
Pred. supp, when true supp ? 0.2
True supp, when pred. supp ? 0.2
Size lt 0.1 0.1-0.15 0.15-0.2 ?0.2
1 0 0 0 65
2 0 1 15 212
3 0 1 3 18
Size lt 0.1 0.1-0.15 0.15-0.2 ?0.2
1 0 0 0 65
2 0 0 28 212
3 1 2 2 18
44Actual Privacy Breaches
- Verified actual privacy breach levels
- The breach probabilities are counted in the
datasets for frequent and near-frequent itemsets. - If maximum supports were estimated correctly,
even worst-case breach levels fluctuated around
50 - At most 53.2 for soccer,
- At most 55.4 for mailorder.
45Talk Outline
- Introduction
- Privacy Breaches
- Our Solution
- Experiments
- Conclusion
- Privacy breaches identified problem and provided
a solution for controlling breaches - Derived estimators of support and variance for a
class of randomization operators - Algorithm for discovering associations in
randomized data - Validated on real-life datasets
- Can find associations while preserving privacy at
the level of individual transactions
47Future Work
- Control of more general privacy breaches
- What about other properties of transactions, for
example item z ? t breach caused by A ? t ?
? - What about external information?
- Theoretical limits of discoverability for a given
privacy breach level - How to compute theoretical limits?
- How to attain them by an algorithm?
48Thank You!
50Our Solution Example
- Old set-up
- Given 10,000 items, 10 M transactions of size 10
- 100,000 transactions (1) contain A x, y, z
- In addition to uniform randomization with p
80, insert 500 new random items to each
transaction. - 800 transactions contain x, y, z before and
after - Roughly (10 M) (500 / 10,000)3 1250
transactions contain none before and full x, y,
z after. - Presence of x, y, z in a randomized transaction
now says little about the original transaction.
51Privacy Breach Analysis
- GIVEN itemset A, and item z ? A
- Assume that partial supports are probabilities
- Define
- Then we have
52Limiting Privacy Breaches
- We want to make sure that always
- But we do not know supports in advance.
- Solution For each itemset size k, give
privacy-challenging test values to . - It is an itemset whose subsets have maximum
supports - We need to estimate maximum support values prior
to randomization
53LDS vs. Transaction Size
? 50, T 5 M
- Too long transactions cannot be used for
54Related Work
- R. Agrawal, R. Srikant Privacy Preserving Data
Mining, SIGMOD 2000 - Each client has a numerical attribute xi
- Client i sends xi yi , where yi random
offset, with known distribution - Server reconstructs the distribution of original
attributes ( EM algorithm) - The distribution is then used for classification
- Numerical attributes only
55Related Work
- Y. Lindell and B. Pinkas Privacy Preserving Data
Mining, Crypto 2000 - J. Vaidya and C. Clifton Privacy Preserving
Association Rule Mining in Vertically Partitioned
56Privacy Concern
- Popular press
- The End of Privacy, The Death of Privacy
- Government directives
- European directive on privacy protection (Oct 98)
- Canadian Personal Information Protection Act (Jan
2001) - Surveys of Web users
- 17 fundamentalists, 56 pragmatic majority, 27
marginally concerned (April 99) - 82 said having privacy would matter (July 99)