Title: Graduate Course DataMining
1Graduate CourseDataMining
2DataMinig
- Knowledge discovery in databases
- Association Rule
- A?B
- Transactions containing A tend to also contain
the items - Confidence
- The percentage of transactions containing B among
the transaction containing A - Support
- The percentage of transactions that contain both
A and B
3- Fast Algorithms for Mining Association Rules
4Problem Statement
- I i1,i2, , im //set of items
- general association rule
- X?Y, where X ? I, and Y ? I, X ?Y ?
- confidence c if c of transactions in D that
contain X also contain Y - support s if s of transactions in D contain X?Y
- Given a set of transaction D, the problem of
mining association rules is to generate all
association rules that have support and
confidence greater than minsup and minconf,
respectively
5Problem Decomposition
- Find all sets of items (large itemset) that have
transaction support above minsup - Use large itemsets to generate the desired rules.
For each large itemset l, find all non-empty
subsets of l. For every such subset a, output a
rule of the form a?(l-a) if the ratio of
support(l) to support(a) is at least minconf.
6Discovering Large Itemsets
- Require multiple pass
- 1st pass, find all large itemsets whose size is
one. - In each subsequence pass, we start with a seed
set of itemsets (candidate set) found to be large
in the previous pass. Then compute support. - Anti-Monotonic
- if sup(A) gt minSup, sup(A) gt minSup where A ?
A
7Aprior Algorithm
- L1 large 1-items
- for( k 2 Lk-1 !0 k) do
- Ck apriori-gen(Lk-1)
- forall transactions t ? D do
- Ct subset(Ck,t) //cadidates contained in t
- for all candidates c ? Ct do
- c.count
- end
- Lk c ?Ckc.count gt minsup
- end
- Answer ?Lk
8AprioriGen
- insert into Ck
- select p.item1, p.item2, , p.itermk-1,q.itemk-1
- from Lk-1 p, Lk-1 q
- where p.iterm1 q.iterm1,,p.itermk-2
q.itermk-2,p.itemk-1 lt q.itermk-1 -
- forall itemsets c ? Ck do
- forall (k-1)-subsets s of c do
- if(not(s ?Lk-1 )) then
- delete c from Ck
- Using Lk-1, generate super sets of k-item
- c ?Ck? c??? k-1?? ??? ??? ?? ??? ??? ???? Lk-1?
???? ?? ?? c? Ck?? ????
9Example
- Item set I A, B, C, D, E
- min_sup 0.4(i.e., gt2 transactions)
- D
TID ????
100 A,C,D
200 B,C,E
300 A,B,C,E
400 B,E
10itemset support itemset support
A 2/4 A 2/4
B 3/4 B 3/4
C 3/4 C 3/4
D 1/4 E 3/4
E 3/4
11itemset itemset support itemset support
A,B A,B 1/4 A,C 2/4
A,C A,C 2/4 B,C 3/4
A.E A,E 1/4 B,E 3/4
B,C B,C 2/4 C,E 2/4
B,E B,C 3/4
C,E C,E 2/4 Â Â Â
12- Pass3
- sup(B,C,E ) 2 and sup(B,C) 2
- Thus, rule B,CgtE with confidence 100
itemset itemset support itemset support
B,C,E B,C,E 2/4 B,C,E 2/4
13AprioriTid
- Principle of Apriori is simple
- As increase the length of itemset by 1, whole DB
should be retrieved. - AprioriTid Index? ??
- As Pass gone, the size of Index Ck is reduced.
14AprioriTid Algorithm
- L1 large 1-itermsets
- C1 database D
- for (k 2 Lk-1 ?0 k) do begin
- Ck apriori-gen(Lk-1) //new candidate
- Ck 0
- forall entries t ? Ck-1 do begin
? (1) - //determine candidate itemsets in Ck contained
- //in the transaction with identifier t.TID
- Ct c ? Ck (c ck) ? t.set-of-itemsets
? - (c ck-1) ? t.set-of-itemsets ? (2)
- forall candidates c ? Ct do
- c. count
- if (Ct ? 0) then Ck ltt.TID, Ctgt
- end
- Lk c ?Ck c.count min_sup
- end
- Answer ?k Lk
- ck denotes kth item
15Example
TID Set-of-ItemSet itestset support itemset support
100 A,C,D A 2/4 A,B 1/4
200 B,C,E B 3/4 A,C 2/4
300 A,B,C,E C 3/4 A,E 1/4
400 B,E E 3/4 B,C 2/4
B.E 3/4
C,E 2/4
16TID Set-of-ItermSet ???? ??? ???? ???
100 A C A C 2/4 B C E 2/4
200 B C,B E, C E B C 2/4
300 A B,A C,A E,B C,B E,C E B E 3/4
400 B E C E 2/4
17Example
TID Set-of-ItermSets itemset support
200 B C E B C E 2/4
300 B C E
18Apriori HyBrid
- Apriori and AprioriTid use the same candidate
generation procedure and therefore count the same
itemsets. - In the later passes, the number of candidate
itemsets reduces. However, Apriori still examines
every transaction in DB. In other hand,
AprioriTid use Index. - Thus, AprioruHybrid perform Apriori in initial
passes, then, if the size of Ck is enough small
to fix memory, AprioriTid is performed in order
to reduce DISK I/O.5