Title: SECURED OUTSOURCING OF FREQUENT ITEMSET MINING
1SECURED OUTSOURCING OF FREQUENT ITEMSET MINING
- Hana Chih-Hua Tai
- Institute of Information Science, Academia Sinica
2OUTLINE
- Preliminary Frequent ItemSet Mining
- Motivation
- Privacy Model K-Support Anonymity
- Algorithm
- Performance Studies
- Conclusion
3OUTLINE
- Preliminary Frequent ItemSet Mining
- Motivation
4FREQUENT ITEMSET MINING (FIM)
- Discover what happened frequently
When threshold set as 3 (60), wine and
cigar are frequent. When threshold set as 2
(40), wine, cigar, tea, beer, wine,
cigar, and wine, beer are frequent.
Trans. ID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
5FREQUENT ITEMSET MINING (FIM)
- Discover what happened frequently
- Frequent itemset mining (FIM)
When threshold set as 3 (60), wine and
cigar are frequent. When threshold set as 2
(40), wine, cigar, tea, beer, wine,
cigar, and wine, beer are frequent.
Trans. ID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
6THE NEEDS OF OUTSOURCING FIM
- For those who lack of expertise in FIM and/or
computing resources, they have the need of
outsourcing the mining tasks to a professional
third party.
Data Owner
Mining Services Provider (Cloud Computing)
7THE NEEDS OF OUTSOURCING FIM
- For those who lack of expertise in FIM and/or
computing resources, they have the need of
outsourcing the mining tasks to a professional
third party.
Data Owner
Privacy?!
Mining Services Provider (Cloud Computing)
8THE RISKS OF OUTSOURCING FIM
- Encryption/decryption method is believed as the
possible solution.
Mining Services Provider (Cloud Computing)
9THE RISKS OF OUTSOURCING FIM
- Encryption/decryption method is believed as the
possible solution.
How to achieve the encryption and decryption?
Mining Services Provider (Cloud Computing)
10THE RISKS OF OUTSOURCING FIM
Trans. ID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
Trans. ID Items
1 a
2 a, c
3 c, d
4 a, b, c
5 a, b, d
Encrypt
11THE RISKS OF OUTSOURCING FIM
- Top frequency attack
- Wine is the most frequent item ? a is wine
- Approximate support attack
- The support of cigar is about 5560 ?c is
cigar -
-
-
Trans. ID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
Trans. ID Items
1 a
2 a, c
3 c, d
4 a, b, c
5 a, b, d
Encrypt
12THE RISKS OF OUTSOURCING FIM
The support information about the frequent
itemsets can be utilized to effectively reveal
the raw data as well as the sensitive information
from the anonymized transactions. T.
Mielikainen. Privacy problems with anonymized
transaction databases. In Proc. of Discovery
Science, 2004.
- Top frequency attack
- Wine is the most frequent item ? a is wine
- Approximate support attack
- The support of cigar is about 5560 ?c is
cigar -
-
-
The Risks of Outsourcing FIM
Trans. ID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
Trans. ID Items
1 a
2 a, c
3 c, d
4 a, b, c
5 a, b, d
Encrypt
13RELATED WORKS
- Encrypt each real items by a one-many mapping
function. - Wong, W. K., Cheung, D. W., Hung, E., Kao, B.,
Mamoulis, N. Security in Outsourcing of
Association Rule Mining. In Proc. of VLDB,
2007. - However, it does not try to anonymize the support
information. - Recently it is cracked.
- Molloy, I., Li, N., Li, T. On the (In)Security
and (Im)Practicality of Outsourcing Precise
Association Rule Mining. In Proc. of ICDM,
2009.
14OUTLINE
- Preliminary Frequent ItemSet Mining
- Motivation
- Privacy Model K-Support Anonymity
15K-SUPPORT ANONYMITY ANONYMIZATION
- For every sensitive item, there are at least k-1
other items of the same support. - The probability of an item being correctly
re-identified is limited to 1/k, even when the
precise support information is known. - Given a transactional database T, encrypt T into
E(T) such that - There exist a decryption function D such that
MiningResult(T, ?) D(MiningResult(E(T), ?)), for
any minimal support ?. - E(T) is k-support anonymous.
-
16SOLUTION 1 A NAÏVE APPROACH
- For each set of real items of the same support,
add enough fake items randomly into transactions
to make the fake items as frequent as real ones. -
-
-
-
Items
a, e, g, h, i
a, c, e, f, h, i
c, d, e, f, g
a, b, c, f, h
a, b, d, e, f, g
Trans. ID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
For k 3, 16 additional items are required.
4 x 2 8 (e, f) for wine 3 x 2 6 (g, h) for
cigar 2 x 1 2 (i) for beer and tea
17A NAÏVE SOLUTION
- For each set of real items of the same support,
add enough fake items randomly into transactions
to make the fake items as frequent as real ones. -
-
-
-
There could be too large storage overhead when k
is large.
Items
a, e, g, h, i
a, c, e, f, h, i
c, d, e, f, g
a, b, c, f, h
a, b, d, e, f, g
Trans. ID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
For k 3, 16 additional items are required.
4 x 2 8 (e, f) for wine 3 x 2 6 (g, h) for
cigar 2 x 1 2 (i) for beer and tea
18GENERALIZED FIM
- Discover all frequent items across concept
levels, given a taxonomy indicating the
hierarchical concepts between items -
Trans. ID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
When threshold set as 3 (60), wine, cigar,
alcoholic, beverage and all prod. are
frequent. beverage, cigar are also frequent.
19GENERALIZED FIM
- Discover all frequent items across concept
levels, given a taxonomy indicating the
hierarchical concepts between items -
Trans. ID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
When threshold set as 3 (60), wine, cigar,
alcoholic, beverage and all prod. are
frequent. beverage, cigar are also frequent.
1. The support of a parent node comes from the
supports of it child nodes. 2. Only lead
nodes need to appear in the transactions.
20OUTLINE
- Preliminary Frequent ItemSet Mining
- Motivation
- Privacy Model K-Support Anonymity
- Algorithm
21ANONYMIZATION OVERVIEW
- For storage efficiency, we suggest to convert FIM
to GFIM. -
-
-
-
-
-
-
-
Data Owner
Third Party
Encrypt Transaction Data
Encrypted
Transaction Data
Pseudo Taxonomy
Pseudo Taxonomy Generation in the Encryption
Generalized Frequent Itemset Mining
Frequent Itemsets
Decrypt Frequent Itemsets
22ANONYMIZATION STORAGE EFFICIENCY
- In GFIM, items can be at multiple levels of a
taxonomy and only the items at leaf level need to
appear in the database.
Trans. ID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
Encrypt with k3
Trans. ID Items
1 c, d, g
2 b, d, g
3 b, h
4 a, b, c
5 a, c, d, h
4 additional items required
23ANONYMIZATION STORAGE EFFICIENCY
- In GFIM, items can be at multiple levels of a
taxonomy and only the items at leaf level need to
appear in the database.
Small storage overhead compared to the naïve
method.
Trans. ID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
Encrypt with k3
Trans. ID Items
1 c, d, g
2 b, d, g
3 b, h
4 a, b, c
5 a, c, d, h
4 additional items required
24ANONYMIZATION EASY DECRYPTION
- The real frequent itemsets can be obtained by
filtering out patterns containing any fake item
in 1 scan of the returned results.
Trans. ID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
Trans. ID Items
1 c, d, g
2 b, d, g
3 b, h
4 a, b, c
5 a, c, d, h
min_sup 2
Results beer, cigar, wine, tea,
beer, wine, cigar, wine
Results a, b, c, d, e, f, g, h, i, j, k, ac,
af, bf, ce,
25ANONYMIZATION EASY DECRYPTION
- The real frequent itemsets can be obtained by
filtering out patterns containing any fake item
in 1 scan of the returned results.
The data owner can obtain the real results in 1
scan of the returned itemsets.
Trans. ID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
Trans. ID Items
1 c, d, g
2 b, d, g
3 b, h
4 a, b, c
5 a, c, d, h
min_sup 2
Results beer, cigar, wine, tea,
beer, wine, cigar, wine
Results a, b, c, d, e, f, g, h, i, j, k, ac,
af, bf, ce,
26ANONYMIZATION ENCRYPTION
The problem is how to build the taxonomy and
encrypt T for k-support anonymity.
Trans. ID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
Trans. ID Items
1 c, d, g
2 b, d, g
3 b, h
4 a, b, c
5 a, c, d, h
Encrypt with k3
27ANONYMIZATION ENCRYPTION
- 1 Generalization of the Mining Task
- To generate a pseudo taxonomy that can
- (a) conserve the correct and complete mining
results, - (b) facilitate k-support anonymization.
- 2 Anonymization with Taxonomy Tree
- To encrypt T for k-support anonymity with the
help of the constructed taxonomy tree.
281 GENERALIZATION OF THE MINING TASK
- Build a k-bud tree of T
- All real items at the leaf level
- The number of nodes in three categories is equal
to or greater than k - Let xM denote the most frequent real item in T
- Agt v sup(v) gt sup(xM) and v is leaf,
- A v sup(v) sup(xM), and
- Alt v sup(v) lt sup(xM) lt sup(u), where u is
the parent node of v . -
-
-
Trans. ID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
3-bud tree
291 GENERALIZATION OF THE MINING TASK
Trans. ID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
3 groups
beer cigar
wine
tea
301 GENERALIZATION OF THE MINING TASK
Trans. ID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
3 subtrees
4
2
4
2
3
(beer)
(wine)
(cigar)
(tea)
311 GENERALIZATION OF THE MINING TASK
Trans. ID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
321 GENERALIZATION OF THE MINING TASK
Trans. ID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
3 bud-tree
332 ANONYMIZATION WITH TAXONOMY TREE
- Alternate k-bud tree and modify T simultaneously
to achieve k-support anonymity - Insertion
- Split
- Increase
342 ANONYMIZATION WITH TAXONOMY TREE
- Alternate k-bud tree and modify T simultaneously
to achieve k-support anonymity - Insertion (Ex.)
- Split
- Increase
sup(v) lt target-sup lt sup(u)
p the node with target support q randomly
select sup(p) sup(v) transactions from T(u)
T(v) T(x) is the set of transactions containing
the item x.
sup(u) and sup(v) should not be changed.
352 ANONYMIZATION WITH TAXONOMY TREE
For wine
TID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
Items
wine, p1
cigar, wine, p1
cigar, tea
beer, cigar, wine
beer, tea, wine
y
x
p1
3-bud tree
insertion
362 ANONYMIZATION WITH TAXONOMY TREE
- Alternate k-bud tree and modify T simultaneously
to achieve k-support anonymity - Insertion
- Split (Ex.)
- Increase
target-sup lt sup(v)
p randomly select target-sup transactions from
T(v) q T(p) T(v) T(q) T(x) is the set of
transactions containing the item x.
sup(v) should not be change.
Split operation can raise up leaf nodes to
internal nodes!
372 ANONYMIZATION WITH TAXONOMY TREE
For wine
For cigar
TID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
Items
wine, p1
cigar, wine, p1
cigar, tea
beer, cigar, wine
beer, tea, wine
Items
p1, p2
cigar, p1, p3
cigar, tea
beer, cigar, p2
beer, tea, p2
y
x
p1
p2
p3
3-bud tree
insertion
split
382 ANONYMIZATION WITH TAXONOMY TREE
- Alternate k-bud tree and modify T simultaneously
to achieve k-support anonymity - Insertion
- Split
- Increase (Ex.)
randomly select target-sup sup(v) transactions
from T(u) T(v)
sup(v) lt target-sup
sup(v) should not be changed. So, Increase
operation is applicable only on node that does
not belong to any anonymous group!
392 ANONYMIZATION WITH TAXONOMY TREE
For wine
For cigar
For cigar
TID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
Items
wine, p1
cigar, wine, p1
cigar, tea
beer, cigar, wine
beer, tea, wine
Items
p1, p2
cigar, p1, p3
cigar, tea
beer, cigar, p2
beer, tea, p2
Items
p1, p2, p3
cigar, p1, p3
cigar, tea
beer, cigar, p2
beer, tea, p2, p3
y
x
p1
p2
p3
p3
3-bud tree
insertion
split
increase
402 ANONYMIZATION WITH TAXONOMY TREE
3-support anonymity
For wine
For cigar
For cigar
TID Items
1 wine
2 cigar, wine
3 cigar, tea
4 beer, cigar, wine
5 beer, tea, wine
TID Items
1 c, d, g
2 b, d, g
3 b, h
4 a, b, c
5 a, c, d, h
Items
wine, p1
cigar, wine, p1
cigar, tea
beer, cigar, wine
beer, tea, wine
Items
p1, p2
cigar, p1, p3
cigar, tea
beer, cigar, p2
beer, tea, p2
Items
p1, p2, p3
cigar, p1, p3
cigar, tea
beer, cigar, p2
beer, tea, p2, p3
y
x
p1
p2
p3
p3
3-bud tree
insertion
split
increase
41OUTLINE
- Preliminary Frequent ItemSet Mining
- Motivation
- Privacy Model K-Support Anonymity
- Algorithm
- Performance Studies
- Conclusion
42PERFORMANCE STUDIES
- Data sets
- Retail dataset
- 88162 transactions with 2117 different items
- T10I1kD100k dataset
- 100k transactions with 1000 different items
- Security
- Against precise item support attacks
- Against precise itemset support attacks
- Storage overhead
- Execution efficiency
43SECURITY
- Against precise item support attacks
- Item accuracy The ratio of items being
re-identified - DB accuracy The avg. ratio of items in a
transaction being re-identified
43
(a) Retail dataset
(b) T10I1kD100k dataset
44SECURITY
- Against precise itemset support attacks
- Item accuracy The ratio of items being
re-identified - DB accuracy The avg. ratio of items in a
transaction being re-identified
44
(a) Retail dataset
(b) T10I1kD100k dataset
45STORAGE OVERHEAD
EXECUTION EFFICIENCY
46SUMMARY
- We proposed k-support anonymity to enhance the
privacy protection in outsourcing of frequent
itemset mining (FIM). - For storage efficiency, we transformed FIM to
GFIM, and proposed a taxonomy-based anonymization
algorithm. - Our method allows the data owner to obtain the
real frequent itemsets in 1 scan of the returned
results. - Experimental results on both real and synthetic
data sets showed that our method can achieve very
good privacy protection with moderate storage
overhead.
47Q A