Research issues on association rule mining - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Research issues on association rule mining

Description:

'Interesting' is a subjective sense... Domain knowledge is needed at some ... Possible interesting problems concerning association rule mining on data streams: ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 32

Provided by: looki

Category:

more less

Transcript and Presenter's Notes

Title: Research issues on association rule mining

1
Research issues on association rule mining

Loo Kin Kong
26th February, 2003

2
Plan

Recent trends on data mining
Association rule interestingness
Association rule mining on data streams
Research directions
Conclusion

3
Association rules

First proposed in Agarwal et al. 94
Given a database D of transactions, which
contains only binary attributes
For an itemset x, the support of x is defined
as supp(x) fraction of D containing x
An association rule is in the form I ? J, where
I ? J ?
supp(I ? J) ? ?supp
supp(I ? J) / supp(I) ? ?conf

4
Recent trends on association rule mining

Association rule interestingness
Association rule mining on data streams
Privacy preserving Rizvi el al. 02
New data structures to improve the efficiency of
finding frequent itemsets Relue et al. 01

5
Association rule interestingness overview

Problem with association rule mining
Too many rules mined
Mined rules may contain redundancy or trivial
rules
Subjective approaches aim at
Minimizing human effort involved
Objective approaches aim at
Based on some predefined interestingness measure,
filter rules that are uninteresting

6
Subjective approaches

Rule templates Klemettinen et al. 94
A rule template specifies what attributes to
occur in the LHS and RHS of a rule
e.g., any rule in the form (any number of
conditions) ? is uninteresting
By elimination Sahar 99
For a rule r A ? B, r a ? b is an ancestor
rule if a ? A and b ? B. r is said to cover r.
An ancestor rule can be classified as one of the
following
True-Not-Interesting (TNI)
Not-True-Interesting (NTI)
Not-True-Not-Interesting (NTNI)
True-Interesting (TI)

7
Objective approaches

Statistical / problem-specific measures
Entropy gain, lift,
Pruning redundant rules by the maximum entropy
principle Jaroszewica 02

8
Probability

A finite probability space is a pair (S,P), in
which
S is a finite non-empty set
P is a mapping PS ? 0,1, satisfying ?s?SP(s)
1
Each s?S is called an event
P(s), also denoted by ps, is the probability of
the event s
The self-information of s is defined as I(s)
log P(s)

9
Entropy

A partition U is a collection of mutually
exclusive elements whose union equals S
Each element contains one or more events
The measure of uncertainty that any event of a
partition U would occur is called the entropy of
the partitioning U H(U) p1 log p1 p2 log
p2 pN log pN
Where p1, ... , pN are respectively the
probabilities of events a1, ... , aN of U
H(U) is maximum if p1 p2 ... pN 1/N

10
The maximum entropy method (MEM)

The MEM determines the probabilities pi of the
events in a partition U, subject to various given
constraints.
By MEM, when some of the pis are unknown, they
must be chosen to maximize the entropy of U,
subject to the given constraints.

11
Definitions

A constraint C is a pair C (I, p), where
I is an itemset
p?0,1 is the probability of I occurring in a
transaction
The set of constraints generated by an
association rule I ? J is defined as C(I ? J)
(I, supp(I)), (I ? J, supp(I ? J))
A rule K ? J is a sub-rule of I ? J if K ? I

12
I-nonredundancy

A rule I ? J is considered I-nonredundant with
respect to R, where R is a set of association
rules, if
I ?, or
I(CI,J(R), I ? J) is larger than some threshold,
where I() is either Iact() or Ipass(), CI,J(R) is
the constraints induced by all sub-rules of I ? J
in R

13
Pruning redundant association rules

Input A set R of association rules
For each singleton Ai in the database
Ri ? ? Ai
k 1
For each rule I ? Ai ? R, Ik, do
If I ? Ai is I-nonredundant w.r.t. Ri then
Ri Ri ? I ? Ai
k k1
Goto 4
R ??Ri

14
Association rule interestingness lets face it...

Interesting is a subjective sense...
Domain knowledge is needed at some stage to
determine what is interesting
... in fact, one may argue that there does not
exist a truly objective interestingness
measure...
It is because we try to model what is interesting
... but objective interestingness measures are
still worth studying
Can act as a filter before any human intervention
is required

15
Interesting or uninteresting?

Consider the association rule r I ? J,
supp(r) 1, conf(r) 100
A question
Do you think whether r is interesting or
uninteresting?
Considering the support and/or confidence of one
single rule may not be enough to determine
whether a rule is interesting or not
So we try to compare a rule with some other
rule(s)

16
Observation comparing a family of rules

For a maximal frequent itemset I
The set of rules I ? i, where i ? I, I ?
I \ i forms a family of rules
For example, for the maximal frequent itemset
abcde, abcd ? e conf supp(abcde)/supp(a
bcd) abc ? e conf supp(abce)/supp(abc)
abd ? e conf supp(abde)/supp(abd)
...are in a family

17
abcde
abcd
abce
abde
acde
bcde
bcd
abe
ace
ade
bce
bde
cde
abc
abd
acd
bc
bd
cd
ae
be
ce
de
ab
ac
ad
a
b
c
d
e
?
18
Observation comparing a family of rules (contd)

The blue half of the lattice is obtained by
appending the item e to each node in the orange
half
The family of rules captures how the item e
affects the support of the orange half of the
lattice
Idea
We may compare confidences of rules in a family
to find any unusually high or low confidences
We can use some statistical tests to perform the
comparison no need for complicated statistical
models (e.g., MEM)

19
Association rule mining on data streams

In some new applications, data come as a
continuous stream
The sheer volume of a stream over its lifetime is
huge
Queries require timely answer
Examples
Stock ticks
Network traffic measurements
A method for finding approximate frequency counts
on data streams is proposed in Manku et al. 02

20
Goals of the paper

The algorithm ensures that
All itemsets whose true frequency exceeds sN are
reported (i.e., no false negative)
No itemset whose true frequency is less than
(s-?)N is output
Estimated frequencies are less than the true
frequencies by at most ?N
Some notations
Let N denote the current length of the stream
Let s ?(0,1) denote the support threshold
Let ? ?(0,1) denote the error tolerance

21
The simple case finding frequent items

Each transaction in the stream contains only 1
item
2 algorithms were proposed, namely
Sticky Sampling Algorithm
Lossy Counting Algorithm
Features of the algorithms
Sampling techniques are used
Frequency counts found are approximate but error
is guaranteed not to exceed a user-specified
tolerance level
For Lossy Counting, all frequent items are
reported

22
Lossy Counting Algorithm

Incoming data stream is conceptually divided into
buckets of ?1/?? transactions
Counts are kept in a data structure D
Each entry in D is in the form (e, f, ?), where
e is the item
f is the frequency of e in the stream since the
entry is inserted in D
? is the maximum count of e in the stream before
e is added to D

23
Lossy Counting Algorithm (contd)

D ? ? N ? 0
w ? ?1/?? b ? 1
e ? next transaction N ? N 1
if (e,f,?) exists in D do
f ? f 1
else do
insert (e,1,b-1) to D
endif
if N mod w 0 do
prune(D, b) b ? b 1
endif
Goto 3

function prune(D, b)
for each entry (e,f,?) in D do
if f ? ? b do
remove the entry from D
endif

24
Lossy Counting

Lossy Counting guarantees that
When deletion occurs, b ? ?N
If an entry (e, f, ?) is deleted, fe ? b where fe
is the actual frequency count of e
Hence, if an entry (e, f, ?) is deleted, fe ? ?N
Finally, f ? fe ? f ?N

25
The more complex case finding frequent itemsets

The Lossy Counting algorithm is extended to find
frequent itemsets
Transactions in the data stream contains any
number of items
Essentially the same as the case for single
items, except
Multiple buckets (? of them say) are processed in
a batch
Each entry in D is in the form (set, f, ?)
Transactions read in are (wisely) expanded to its
subsets

26
Association rule mining on data streams food for
thought

Challenges to mine from data streams
Fast update
Data are usually not permanently stored (but may
be buffered)
Fast response for queries
Minimized resources (e.g. number of counts kept)
Possible interesting problems concerning
association rule mining on data streams
More efficient/accurate algorithms for finding
association rules on data streams
Change mining in frequency counts

27
The lattice structure

A bottleneck in the algorithm proposed in Manku
et al. 02 is that it needs to expand a
transaction to its subsets for counting
For example, for a transaction abcde, we may
need to count the itemsets a, b, c, d,
e, ab, ac...
Hence updates are expensive (although queries can
be fast)

28
The lattice structure (contd)
abcde
abcd
abce
abde
acde
bcde
acd
ace
ade
bcd
bce
bde
cde
abc
abd
abe
ae
bc
bd
be
cd
ce
de
ab
ac
ad
a
b
c
d
e
?
29
Conclusion

Both association rule interestingness and mining
on data streams are challenging problems
Research on rule interestingness can make
association rule mining a more efficient tool for
knowledge discovery
Association rule mining on data streams is an
upcoming application and a promising direction
for research

30
References

Agarwal et al. 94 R. Agarwal and R. Srikant.
Fast Algorithms for Mining Association Rules.
VLDB94.
Jaroszewica 02 S. Jaroszewica and D.A.
Simovici. Pruning Redundant Association rules
Using Maximum Entropy Principle. PAKDD02.
Klemettinen et al. 94 Mika Klemettinen et al.
Finding Interesting Rules from Large Sets of
Discovered Association Rules. CIKM94.
Manku et al. 02 G. S. Manku and R. Motwani.
Approximate Frequency Counts over Data Streams.
VLDB02.
Relue et al. 01 R. Relue, X. Wu and H Huang.
Efficient Runtime Generation of Association
Rules. CIKM01.
Rizvi el al. 02 S. J. Rizvi and J. R. Haritsa.
Maintaining Data Privacy in Association Rule
Mining. VLDB02.
Sahar 99 Sigal Sahar. Interestingness Via What
Is Not Interesting. KDD99.