Fault-tolerant Frequent Patterns Mining - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Fault-tolerant Frequent Patterns Mining

Description:

Proportional fault-tolerant pattern mining: ... Lemma 1 (Extended Fault-tolerant Apriori): If X is not a FT-pattern, then none ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 55
Provided by: Hsu
Category:

less

Transcript and Presenter's Notes

Title: Fault-tolerant Frequent Patterns Mining


1
Fault-tolerant Frequent Patterns Mining
2
Introduction
  • Traditional association rules mining
  • Extracting exactly match patterns

Head cold Symptoms coughing, nose tearing,
headache, throat hurt, fever, palpitations,
vomiting Treatment Vit-C,
Fever over 38?, throat hurt, headache
3
Introduction
  • Fault-tolerant mining
  • Allowing limited inexactitude

Head cold Symptoms coughing, nose tearing,
headache, throat hurt, fever, palpitations,
vomiting Treatment Vit-C,
Fever over 38?, throat hurt, headache
4
Introduction
  • Previous work
  • YFB01 Discovering of groups of similar
    transactions that share most items.
  • Focusing on transactions, not items
  • Sparse pattern problem

tid item 1 2 3 4 5 6
010 1 1 1 1 0 0
020 1 1 1 1 0 0
030 1 1 1 1 0 0
040 1 1 1 1 0 1
050 0 0 0 0 1 0
060 0 0 0 0 1 0
d 0.8 min_sup 4
5
Introduction
  • Previous work
  • PTH01WL02 Mining those patterns tolerate d
    items mismatched within the pattern.
  • Unfair pattern problem Tolerating fixed number
    of items no matter how long the itemset is

6
Problem description
  • Proportional fault-tolerant pattern mining
  • Finding such patterns as X, while items in
    each sub-pattern of X with length (Xd)
    frequently occur together.
  • For example
  • X a b c d , delta0.75,
  • X is a FT pattern gt a b c, a b d, a c
    d, b c d frequently occur

7
Problem description-definition
  • A transaction t FT-contains pattern X iff t
    contains x, where x is sub-pattern of X and
    (d is a fault-tolerant parameter)
  • supFT(X) of transactions FT-contains X.
  • supitemB(X)(x) of transactions contains x in
    the transactions which FT-contains X.

tid items
010 c d e
020 b d e
030 a d e
040 a b c
050 a b c
X abcd, d0.75 supFT(X) 040, 050
2 supitemB(X)(a) 040, 050 2
supitemB(X)(d) 0
8
Problem description-definition
  • A pattern X is a FT-pattern iff
  • 1. supFT(X) gt min_supFT
  • 2. For each item x in X,
  • supitemB(X)(x) gt min_ supitem

9
Problem description-observation
10
Problem description-challenge
  • The sets of patterns separated by the gap are
    independent. i.e., the anti-monotonic property
    does not exist.

d0.6 min_supFT5 min_supitem2 fault(3)1 fault
(4)1 fault(5)2
C4 abcd 2, ---- abce 2, ---- abde 2,
---- acde 2, ---- bcde 2, ----
C5 abcde 5, (3, 2, 3, 3, 3)
C3 abc 2, --- ade 3, --- abd 4, ---
bcd 4, --- abe 4, --- bce 4, --- acd 4,
--- bde 3, --- ace 4, --- cde 3, ---
c d e
b d e
a d e
a b c
a b c
11
Approaches
  • Lemma 1 (Extended Fault-tolerant Apriori) If X
    is not a FT-pattern, then none of its superset
    with the same number of faults will be a
    FT-pattern.

12
Approaches
  • Lemma 2 Given a pattern X and the set of its
    sub-patterns set(Xsubpattern), where for all
    pattern P in Xsubpattern, P X-1. Moreover,
    let fault(X)-1 fault(X-1). (i.e., X and
    the considered subsets are parted by the gap), If
    X is not a frequent FT-pattern, then we have
    following two conditions
  • case 1. if supFT(X) lt min_supFT then for all
    pattern P in Xsubpattern, P can not be a
    FT-pattern.
  • case 2. else if supitemB(X)(xj) lt min_supitem
    where xj denotes an item contained by X, then
    none of patterns in set(Xsubpattern) which
    contains item xj can be a FT-pattern

abcd1 abce1 abde1 acde1 bcde1
abc1 ade1 abd1 bcd1 abe1 bce1 acd1
bde1 ace1 cde1
abcde2
13
FT-LevelWise
14
Observation
  • let d 0.5, and the pattern, milk, bread,
    pencil, eraser, which seems meaningless would be
    mined
  • It is hard to understand the relationships
    between items by observing TDB directly
  • Items which never appear in the same transaction
    might have chance to be composed to a FT-pattern

15
FT-association graph
  • FT-association graph
  • The transactions in TDB are scanned one by one
  • When an item x is first scanned, a node is
    constructed for x and the field used to record
    the support count of x is set to 1
  • Otherwise, add 1 to the support count of x
  • If it is the first time of an item y appears with
    x in the same transaction, an edge exy would be
    constructed for x and y
  • Every times item y appears with x, the edge
    weight wxy is added by 1
  • Item y is called a neighbor of x
  • The nodes that are not frequent are pruned

16
Example
17
Property
  • Lemma 2.1 If an item y is away from x for the
    distance greater than 2 in the FT-association
    graph, then a pattern P which contains both x and
    y can not be a frequent FT-pattern.
  • or
  • The transactions which contain Py will never
    FT-contain P gt supitemB(P)(y) 0

i
18
Property
  • Lemma 2.2 If P is a frequent FT-pattern, for each
    item x of P, there must exist
    items which are neighbors of x in the
    FT-association graph in P.
  • max_sup(P) minwxy x, y ? P
  • Lemma 2.3 Given a pattern P, the upper bound of
    supFT(P), denoted as max_supFT(P), is equal to
    .
  • And if max_supFT(P) lt min_ supFT, then P can
    not be a frequent FT-pattern.

19
Proportional FT Frequent pattern Mining
  • data preprocessing
  • Candidate generation and pruning
  • Checking candidates

20
Data Preprocessing
  • In order to avoid scanning the whole database
    when checking candidates, the original database
    is transformed into a bitmap

21
Data Preprocessing
  • Constructs FT-association graph
  • The support counts of each item and each
    co-appeared itemset are calculated when
    constructing FT-association graph
  • Prunes items whose supports are less than
    min_supitem from both of the bitmap and the
    FT-association graph

22
Candidate generation and pruning
  • The data structure of FT-association graph

23
Checking candidates
  • Extract bitmap(P) for a candidate P
  • Calculate the supFT of P and the supitemB(P)(i)
    of each item i of P
  • Let candidate P abcde, the bitmap(P) is shown
    below

24
Proportional FT-pattern Mining, PFM
25
Fixed FT-pattern Mining, FFM
  • the FT parameter d is redefined as a fixed number
  • Patterns with different length tolerate the same
    number of faults, d
  • MinPattern is set to
  • MaxPattern is no more necessary because of the
    adoption of FT-Apriori heuristic
  • For an item x of FT-pattern P, x must have
    neighbors

26
Conclusions
  • The presented framework can be used to solve both
    of the problems of mining proportional and fixed
    FT-patterns.
  • The proposed lemmas filter out impossible
    candidates with high efficiency
  • Instead of scanning whole database once for the
    candidates in traditional approaches, our method
    checks only small part of the bitmap transformed
    from original database

27
Privacy Preserving Data Mining
28
Introduction
  • Why privacy preserving data mining?
  • Data privacy V.S. Information privacy
  • Data privacy
  • Information privacy

29
Data Privacy Randomization Approach Overview
50 40K ...
30 70K ...
...
Randomizer
Randomizer
65 20K ...
25 60K ...
...
Reconstruct distribution of Age
Reconstruct distribution of Salary
...
Data Mining Algorithms
Model
30
Reconstruction Problem
  • Original values x1, x2, ..., xn
  • from probability distribution X (unknown)
  • To hide these values, we use y1, y2, ..., yn
  • from probability distribution Y
  • Given
  • x1y1, x2y2, ..., xnyn
  • the probability distribution of Y
  • Estimate the probability distribution of X.

31
Reconstruction Bootstrapping
  • fX0 Uniform distribution
  • j 0 // Iteration number
  • repeat
  • fXj1(a)
    (Bayes' rule)
  • j j1
  • until (stopping criterion met)
  • Converges to maximum likelihood estimate.
  • D. Agrawal C.C. Aggarwal, PODS 2001.

32
Information privacy (1)
  • Oliveira_Zaiane proposed 6 algorithms
  • Step1Find sensitive transactions
  • Step2 Choose victim items
  • Step3 Compute how many sensitive transactions
    should be changed
  • Step4 Select victim transactions
  • SWA of Oliveira_Zaiane
  • Almost the same with the others but each step
    applied to K transactions, K is the window size
  • With the best performance

33
Related Work (cont.)
  • Published pattern sets
  • Forward-Inference Attack

34
Preliminary
  • Represent TDB as a binary matrix
  • P frequent patterns
  • PH frequent patterns with security policies
  • PH frequent patterns without security policies
  • PH ?PH P
  • Pair-Subset
  • eg 1, 2, 3 is frequent 1, 2, 1, 3, 2,
    3 is the Pair-Subset of 1, 2, 3 and 1, 2 is
    a pair-subpattern of 1, 2, 3

35
Problem Definition
  • Transform D into D', such that PH are hidden and
    PH are still mined in D' and also avoid
    Forward-Inference Attack
  • Kernel Ideal
  • D is multiplied by a sanitization matrix S
  • The problem is transformed to how to define S

36
Matrix Multiplication
  • If Dij 0, D'ij is set to 0 directly
  • If 1, D'ij is set to 1
  • If 0, D'ij is set to 0

37
Matrix Observation
  • Setting of 1
  • If Sij id set to 1, for the row that Dti and
    Dtj are both equal to 1, D'ij will become 0

38
Matrix Observation (cont.)
  • Setting of 1
  • Setting Sij to 1 can keep the relation between
    item i and item j by enhancing the strength of
    item j

39
Sanitization Process
40
Marked-Set Generation
  • 1. Put the patterns with length 2 in PH into
    Marked-Set directly
  • 2. for all remainder P in PH do
  • if (P has no Pair-Subsets included in Marked-Set)
  • Generate k groups, k of all Pair-Subsets of P
  • Class label of group is named by each
    Pair-Subsets of P
  • P is stored in each group
  • 3.Merge the groups with same class label

41
Marked-Set Generation (cont.)
  • 4.for all NP in PH do
  • Generate their all Pair-Subsets
  • Count the frequencies of all Pair-Subsets
  • 5. for all groups do
  • If the class label of the group ? any Pair-Subset
    generated in Step1
  • The frequency of the group 0
  • If the class label of the group one Pair-Subset
    generated in Step1
  • The frequency of the group the frequency of the
    Pair-Subset

42
Marked-Set Generation (cont.)
  • 6. Sort the groups by frequency in the increasing
    order
  • 7.for (i 1 to number of groups -1)
  • for ( j i 1 to number of groups)
  • Compare groups pair-wise, Gi, Gj
  • for all overlap in GinGj do
  • If the size of Gi ? the size of Gj
  • Remove overlap from the small one
  • else
  • if Check the frequency
  • Remove overlap from the large one
  • else
  • Remove overlap form the group
    chosen randomly
  • 8.for all groups do
  • If number of patterns stored in group gt 0
  • Put the class label into Marked-Set

43
An overall example
44
(No Transcript)
45
(No Transcript)
46
Sanitization Matrix Setting
  • 1. Sii 1
  • 2. for all i, j in Marked-Set do
  • if( of i in PH lt of j in PH)
  • Sji 1
  • If( of i in PH gt of j in PH)
  • Sij 1
  • else
  • if( of i in Marked-Set gt of j in Marked-Set)
  • Sji 1
  • if( of i in Marked-Set lt of j in Marked-Set)
  • Sij 1
  • else
  • Sji 1 or Sij 1 randomly

47
Sanitization Matrix Setting (cont.)
  • 3.for all i, j in (large2- Marked-Set) do
  • Set Sij 1, Sji 1
  • 4.Sij 0, otherwise

48
(No Transcript)
49
Probability Policies
  • Distortion Probability?
  • Used when only one 1 in the column j
  • and works if D'ij has?j
    to be set to 1 and 1?j to be set to 0

50
Probability Policies (cont.)
  • Lemma1 Give a minimum supportsand a level of
    confidence c. Let i, j be a pattern in
    Marked-Set nij be the support count of i, j ?
    is the probability of column j. W.L.O.G we assume
    that Sij 1. If ? satisfies
  • and

  • where D is the number of transaction in D,
    we can say that we are c confident that i, j
    isnt frequent in D'

51
Probability Policies (cont.)
  • Conformity Probabilityµ
  • Used when the column j of S contains at
  • least two 1s, works if , and
    at
  • least one 1 in j is multiplied by 1 in D,
    D'ij
  • is set to 1 withµand 0 with 1µ

52
Probability Policies (cont.)
  • Lemma 2 Given a minimum support s, and a level
    of confidence c. Let i, j be a pattern in
    Marked-Set, and k, j be a pattern which belongs
    to large2 Marked-Set, nikj be the support
    count of i, k, j. W.L.O.G, we assume that Sij
    1.µis the Conformity probability of column j. If
    µ is set according to the following rule,
  • we can say that we are c confident that i, j
    isnt frequent in D'.

53
Conclusion
  • A probability based approach to solve sensitive
    knowledge problem is proposed
  • In some conditions, the miss cost and the
    dissimilarity is little higher than SWA, but
    overall, better performance than SWA and could
    not suffer from Forward-Inference Attack

54
Reference
  • LCC04Guanling Lee, Chien-Yu Chang and Arbee L.P
    Chen. Hiding sensitive patterns in association
    rules mining. The 28th Annual International
    Computer Software and Applications Conference
    (COMPSAC 2004)
  • OZ02S. R. M. Oliveira and O. R. Zaïane. Privacy
    Preserving Frequent Itemset Mining. In Proc. of
    the IEEE ICDM Workshop on Privacy, Security, and
    Data Mining Japan, December 2002.
  • OZ03aS. R. M. Oliveira and O. R. Zaïane.
    Algorithms for Balancing Privacy and Knowledge
    Discovery in Association Rule Mining. In Proc. of
    the 7th International Database Engineering and
    Applications Symposium (IDEAS03), Hong Kong,
    China, July 2003.
  • OZ03bS. R. M. Oliveira and O. R. Zaïane.
    Protecting Sensitive Knowledge By Data
    Sanitization. In Proc. of the 3rd IEEE
    International Conference on Data Mining
    (ICDM03).
  • OZS04S. R. M. Oliveira, O. R. Zaïane and Yücel
    Saygin. Secure Association Rule Sharing The 8th
    Pacific-Asia Conference on Knowledge Discovery
    and Data Mining 2004(PAKDD-04).
  • VAE04Verykios, V.S. Elmagarmid, A.K. Bertino,
    E. Saygin, Y. Dasseni, E. Association rule
    hiding. IEEE Transactions On Knowledge And Data
    Engineering, Vol. 16, No. 4, April 2004.
Write a Comment
User Comments (0)
About PowerShow.com