Research issues on association rule mining - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Research issues on association rule mining

Description:

'Interesting' is a subjective sense... Domain knowledge is needed at some ... Possible interesting problems concerning association rule mining on data streams: ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 32
Provided by: looki
Category:

less

Transcript and Presenter's Notes

Title: Research issues on association rule mining


1
Research issues on association rule mining
  • Loo Kin Kong
  • 26th February, 2003

2
Plan
  • Recent trends on data mining
  • Association rule interestingness
  • Association rule mining on data streams
  • Research directions
  • Conclusion

3
Association rules
  • First proposed in Agarwal et al. 94
  • Given a database D of transactions, which
    contains only binary attributes
  • For an itemset x, the support of x is defined
    as supp(x) fraction of D containing x
  • An association rule is in the form I ? J, where
  • I ? J ?
  • supp(I ? J) ? ?supp
  • supp(I ? J) / supp(I) ? ?conf

4
Recent trends on association rule mining
  • Association rule interestingness
  • Association rule mining on data streams
  • Privacy preserving Rizvi el al. 02
  • New data structures to improve the efficiency of
    finding frequent itemsets Relue et al. 01

5
Association rule interestingness overview
  • Problem with association rule mining
  • Too many rules mined
  • Mined rules may contain redundancy or trivial
    rules
  • Subjective approaches aim at
  • Minimizing human effort involved
  • Objective approaches aim at
  • Based on some predefined interestingness measure,
    filter rules that are uninteresting

6
Subjective approaches
  • Rule templates Klemettinen et al. 94
  • A rule template specifies what attributes to
    occur in the LHS and RHS of a rule
  • e.g., any rule in the form (any number of
    conditions) ? is uninteresting
  • By elimination Sahar 99
  • For a rule r A ? B, r a ? b is an ancestor
    rule if a ? A and b ? B. r is said to cover r.
  • An ancestor rule can be classified as one of the
    following
  • True-Not-Interesting (TNI)
  • Not-True-Interesting (NTI)
  • Not-True-Not-Interesting (NTNI)
  • True-Interesting (TI)

7
Objective approaches
  • Statistical / problem-specific measures
  • Entropy gain, lift,
  • Pruning redundant rules by the maximum entropy
    principle Jaroszewica 02

8
Probability
  • A finite probability space is a pair (S,P), in
    which
  • S is a finite non-empty set
  • P is a mapping PS ? 0,1, satisfying ?s?SP(s)
    1
  • Each s?S is called an event
  • P(s), also denoted by ps, is the probability of
    the event s
  • The self-information of s is defined as I(s)
    log P(s)

9
Entropy
  • A partition U is a collection of mutually
    exclusive elements whose union equals S
  • Each element contains one or more events
  • The measure of uncertainty that any event of a
    partition U would occur is called the entropy of
    the partitioning U H(U) p1 log p1 p2 log
    p2 pN log pN
  • Where p1, ... , pN are respectively the
    probabilities of events a1, ... , aN of U
  • H(U) is maximum if p1 p2 ... pN 1/N

10
The maximum entropy method (MEM)
  • The MEM determines the probabilities pi of the
    events in a partition U, subject to various given
    constraints.
  • By MEM, when some of the pis are unknown, they
    must be chosen to maximize the entropy of U,
    subject to the given constraints.

11
Definitions
  • A constraint C is a pair C (I, p), where
  • I is an itemset
  • p?0,1 is the probability of I occurring in a
    transaction
  • The set of constraints generated by an
    association rule I ? J is defined as C(I ? J)
    (I, supp(I)), (I ? J, supp(I ? J))
  • A rule K ? J is a sub-rule of I ? J if K ? I

12
I-nonredundancy
  • A rule I ? J is considered I-nonredundant with
    respect to R, where R is a set of association
    rules, if
  • I ?, or
  • I(CI,J(R), I ? J) is larger than some threshold,
    where I() is either Iact() or Ipass(), CI,J(R) is
    the constraints induced by all sub-rules of I ? J
    in R

13
Pruning redundant association rules
  • Input A set R of association rules
  • For each singleton Ai in the database
  • Ri ? ? Ai
  • k 1
  • For each rule I ? Ai ? R, Ik, do
  • If I ? Ai is I-nonredundant w.r.t. Ri then
  • Ri Ri ? I ? Ai
  • k k1
  • Goto 4
  • R ??Ri

14
Association rule interestingness lets face it...
  • Interesting is a subjective sense...
  • Domain knowledge is needed at some stage to
    determine what is interesting
  • ... in fact, one may argue that there does not
    exist a truly objective interestingness
    measure...
  • It is because we try to model what is interesting
  • ... but objective interestingness measures are
    still worth studying
  • Can act as a filter before any human intervention
    is required

15
Interesting or uninteresting?
  • Consider the association rule r I ? J,
    supp(r) 1, conf(r) 100
  • A question
  • Do you think whether r is interesting or
    uninteresting?
  • Considering the support and/or confidence of one
    single rule may not be enough to determine
    whether a rule is interesting or not
  • So we try to compare a rule with some other
    rule(s)

16
Observation comparing a family of rules
  • For a maximal frequent itemset I
  • The set of rules I ? i, where i ? I, I ?
    I \ i forms a family of rules
  • For example, for the maximal frequent itemset
    abcde, abcd ? e conf supp(abcde)/supp(a
    bcd) abc ? e conf supp(abce)/supp(abc)
    abd ? e conf supp(abde)/supp(abd)
    ...are in a family

17
abcde
abcd
abce
abde
acde
bcde
bcd
abe
ace
ade
bce
bde
cde
abc
abd
acd
bc
bd
cd
ae
be
ce
de
ab
ac
ad
a
b
c
d
e
?
18
Observation comparing a family of rules (contd)
  • The blue half of the lattice is obtained by
    appending the item e to each node in the orange
    half
  • The family of rules captures how the item e
    affects the support of the orange half of the
    lattice
  • Idea
  • We may compare confidences of rules in a family
    to find any unusually high or low confidences
  • We can use some statistical tests to perform the
    comparison no need for complicated statistical
    models (e.g., MEM)

19
Association rule mining on data streams
  • In some new applications, data come as a
    continuous stream
  • The sheer volume of a stream over its lifetime is
    huge
  • Queries require timely answer
  • Examples
  • Stock ticks
  • Network traffic measurements
  • A method for finding approximate frequency counts
    on data streams is proposed in Manku et al. 02

20
Goals of the paper
  • The algorithm ensures that
  • All itemsets whose true frequency exceeds sN are
    reported (i.e., no false negative)
  • No itemset whose true frequency is less than
    (s-?)N is output
  • Estimated frequencies are less than the true
    frequencies by at most ?N
  • Some notations
  • Let N denote the current length of the stream
  • Let s ?(0,1) denote the support threshold
  • Let ? ?(0,1) denote the error tolerance

21
The simple case finding frequent items
  • Each transaction in the stream contains only 1
    item
  • 2 algorithms were proposed, namely
  • Sticky Sampling Algorithm
  • Lossy Counting Algorithm
  • Features of the algorithms
  • Sampling techniques are used
  • Frequency counts found are approximate but error
    is guaranteed not to exceed a user-specified
    tolerance level
  • For Lossy Counting, all frequent items are
    reported

22
Lossy Counting Algorithm
  • Incoming data stream is conceptually divided into
    buckets of ?1/?? transactions
  • Counts are kept in a data structure D
  • Each entry in D is in the form (e, f, ?), where
  • e is the item
  • f is the frequency of e in the stream since the
    entry is inserted in D
  • ? is the maximum count of e in the stream before
    e is added to D

23
Lossy Counting Algorithm (contd)
  • D ? ? N ? 0
  • w ? ?1/?? b ? 1
  • e ? next transaction N ? N 1
  • if (e,f,?) exists in D do
  • f ? f 1
  • else do
  • insert (e,1,b-1) to D
  • endif
  • if N mod w 0 do
  • prune(D, b) b ? b 1
  • endif
  • Goto 3
  • function prune(D, b)
  • for each entry (e,f,?) in D do
  • if f ? ? b do
  • remove the entry from D
  • endif

24
Lossy Counting
  • Lossy Counting guarantees that
  • When deletion occurs, b ? ?N
  • If an entry (e, f, ?) is deleted, fe ? b where fe
    is the actual frequency count of e
  • Hence, if an entry (e, f, ?) is deleted, fe ? ?N
  • Finally, f ? fe ? f ?N

25
The more complex case finding frequent itemsets
  • The Lossy Counting algorithm is extended to find
    frequent itemsets
  • Transactions in the data stream contains any
    number of items
  • Essentially the same as the case for single
    items, except
  • Multiple buckets (? of them say) are processed in
    a batch
  • Each entry in D is in the form (set, f, ?)
  • Transactions read in are (wisely) expanded to its
    subsets

26
Association rule mining on data streams food for
thought
  • Challenges to mine from data streams
  • Fast update
  • Data are usually not permanently stored (but may
    be buffered)
  • Fast response for queries
  • Minimized resources (e.g. number of counts kept)
  • Possible interesting problems concerning
    association rule mining on data streams
  • More efficient/accurate algorithms for finding
    association rules on data streams
  • Change mining in frequency counts

27
The lattice structure
  • A bottleneck in the algorithm proposed in Manku
    et al. 02 is that it needs to expand a
    transaction to its subsets for counting
  • For example, for a transaction abcde, we may
    need to count the itemsets a, b, c, d,
    e, ab, ac...
  • Hence updates are expensive (although queries can
    be fast)

28
The lattice structure (contd)
abcde
abcd
abce
abde
acde
bcde
acd
ace
ade
bcd
bce
bde
cde
abc
abd
abe
ae
bc
bd
be
cd
ce
de
ab
ac
ad
a
b
c
d
e
?
29
Conclusion
  • Both association rule interestingness and mining
    on data streams are challenging problems
  • Research on rule interestingness can make
    association rule mining a more efficient tool for
    knowledge discovery
  • Association rule mining on data streams is an
    upcoming application and a promising direction
    for research

30
References
  • Agarwal et al. 94 R. Agarwal and R. Srikant.
    Fast Algorithms for Mining Association Rules.
    VLDB94.
  • Jaroszewica 02 S. Jaroszewica and D.A.
    Simovici. Pruning Redundant Association rules
    Using Maximum Entropy Principle. PAKDD02.
  • Klemettinen et al. 94 Mika Klemettinen et al.
    Finding Interesting Rules from Large Sets of
    Discovered Association Rules. CIKM94.
  • Manku et al. 02 G. S. Manku and R. Motwani.
    Approximate Frequency Counts over Data Streams.
    VLDB02.
  • Relue et al. 01 R. Relue, X. Wu and H Huang.
    Efficient Runtime Generation of Association
    Rules. CIKM01.
  • Rizvi el al. 02 S. J. Rizvi and J. R. Haritsa.
    Maintaining Data Privacy in Association Rule
    Mining. VLDB02.
  • Sahar 99 Sigal Sahar. Interestingness Via What
    Is Not Interesting. KDD99.

31
Q A
Write a Comment
User Comments (0)
About PowerShow.com