Title: CSE 711 Seminar on Data Mining: Apriori Algorithm
1CSE 711 Seminar on Data MiningApriori Algorithm
By Sung-Hyuk Cha
2Association Rules
Definition Rules that state a statistical
correlation between the occurrence of certain
attributes in a database table. Given a set of
transactions, where each transaction is a set of
items, X1 , ... , Xn and Y, an association rule
is an expression X1 , ... , Xn ?Y. This means
that the attributes X1 , ... , Xn predict
Y Intuitive meaning of such a rule transactions
in the database which contain the items in X tend
also to contain the items in Y.
3Measures for an Association Rule
- Support
- Given the association rule X1 , ... , Xn ?Y,
the support is the percentage of records for
which X1 , ... , Xn and Y both hold. - The statistical significance of the association
rule.
- Confidence
- Given the association rule X1 , ... , Xn ?Y,
the confidence is the percentage of records for
which Y holds, within the group of records for
which X1 , ... , Xn hold. - The degree of correlation in the dataset between
X and Y. - A measure of the rules strength.
4Quiz 2
Problem Given a transaction table D, find the
support and confidence for an association rule
B,D ?E.
Database D
Answer support 3/7, confidence
3/4
5Apriori Algorithm
An efficient algorithm to find association rules.
- Procedure
- Find all the frequent itemsets
- Use the frequent itemsets to generate the
association rules
A frequent itemset is a set of items that have
support greater than a user defined minimum.
6Notation
7Example
Suppose a user defined minimum .49
(k 1) itemset
(k 3) itemset
(k 2) itemset
C3
Support
L3
C1
Support
L1
C2
Support
L2
A
A,B
B,C,E
.50
Y
.25
N
B
A,C
.75
Y
(k 4) itemset
C
A,E
.75
Y
C4
Support
L4
D
B,C
.25
N
A,B,C,E
E
B,E
.75
Y
C,E
2
n items implies O(n - 2) computational
complexity?
8Procedure
Apriorialgo() F ? Lk frequent
1-itemsets k 2 / k
represents the pass number. / while (Lk-1
! ?) F F U Lk Ck
New candidates of size k generated from Lk-1
for all transactions t ? D
increment the count of all candidates in Ck that
are contained in t Lk All candidates
in Ck with minimum support k
return ( F )
9Candidate Generation
Given Lk-1, the set of all frequent
(k-1)-itemsets, generate a superset of the set of
all frequent k-itemsets. Idea if an itemset
X has minimum support, so do all subsets of
X. 1. Join Lk-1 with Lk-1 2. Prune delete
all itemsets c ? Ck such that some (k-1)-subset
of c is not in Lk-1 .
ex) L2 A,C, B,C, B,E, C,E 1.
Join A,B,C, A,C,E, B,C,E 2. Prune
A,B,C, A,C,E, B,C,E
Instead of 5C3 10, we have only 1 candidate.
10Thoughts
Association rules are always defined on binary
attributes. ? Need to flatten the tables.
ex)
Phone Company DB.
CID
Gender
Ethnicity
Call
M
F
W
B
H
A
D
I
CID
- Support for Asian ethnicity will never exceed
.5. - No need to consider itemsets M,F, W,B
nor D,I. - M ? F or D ? I are not of
interest at all.
Considering the original schema before
flattening may be a good idea.
11Finding association ruleswith item constraints
When item constraints are considered, the Apriori
candidate generation procedure does not generate
all the potential frequent itemsets as
candidates.
- Procedure
- 1. Find all the frequent itemsets that satisfy
the boolean expression B. - 2. Find the support of all subsets of frequent
itemsets that do not satisfy B. - 3. Generate the association rules from the
frequent itemsets found in Step 1. by computing
confidences from the frequent itemsets found in
Steps 1 2.
12Additional Notation
B
Boolean expression with m disjuncts B
D1 ? D2 ? ... ? Dm
Di
N conjuncts in Di, Di ai,1 ? ai,2 ?
... ? ai,n
S
Set of items such that any itemset that satisfies
B contains an item from S.
Ls(k)
Set of frequent k-itemsets that contain an item
in S.
Lb(k)
Set of frequent k-itemsets that satisfy B.
Cs(k)
Set of candidate k-itemsets that contain an item
in S.
Cb(k)
Set of candidate k-itemsets that satisfy B.
13Direct Algorithm
- Procedure
- 1. Scan the data and determine L1 and F.
- 2. Find Lb(1)
- 3. Generate Cb(k1) from Lb(k)
- 3-1. Ck1 Lk x F
- 3-2. Delete all candidates in Ck1 that do not
satisfy B. - 3-3. Delete all candidates in Ck1 below the
minimum support. - 3-4. for each Di with exactly k 1 non-negated
elements, - add the itemset to Ck1 if all the items are
frequent.
14Example
Given B (A ? B) ? (C ? ?E)
C1 A, B, C, D, E
step 1 2
Lb(1) C
L1 A, B, C, E
C2 Lb(1) x F A,C, B,C, C,E
step 3-2
step 3-1
Cb(2) A,C, B,C
step 3-4
Lb(2) A,B, A,C, B,C
step 3-3
L2 A,C, B,C
C3 Lb(2) x F A,B,C, A,B,E, A,C,E,
B,C,E
Cb(3) A,B,C, A,B,E
step 3-2
step 3-1
step 3-3
step 3-4
L3 ?
Lb(3) ?
15MultipleJoins and Reorder algorithms to find
association rules with item constraints will be
added.
16Mining Sequential Patterns
Given a database D of customer transactions, the
problem of mining sequential patterns is to find
the maximal sequences among all sequences that
have certain user-specified minimum support.
- Transaction-time field is added. -
Itemset in a sequence is denoted as
lts1, s2, , sngt
17Sequence Version of DB Conversion
D
Sequential version of D
CustomerID
Transaction Time
Items
CustomerID
Customer Sequence
1 1 2 2 2 3 4 4 4 5
Jun 25 93 Jun 30 93 Jun 10 93 Jun 15 93 Jun 20
93 Jun 25 93 Jun 25 93 Jun 30 93 July 25 93 Jun
12 93
30 90 10,20 30 40,60,70 30,50,70 30 40,70 90 90
1 2 3 4 5
lt(30),(90)gt lt(10 20),(30),(40 60 70)gt lt(30 50
70)gt lt(30),(40 70),(90)gt lt(90)gt
Customer sequence all the transactions of a
customer is a sequence ordered by increasing
transaction time.
Answer set with support gt .25 lt(30),(90)gt,
lt(30),(40 70)gt
18Definitions
Def 1. A sequence lta1, a2, , angt is contained
in another sequence ltb1, b2, , bmgt if there
exists integers i1 lt i2 lt lt in such that a1
? bi1, a2 ? bi2 , , an ? bin ex) lt(3), (4
5), (8)gt is contained in lt(7) ,(3 8),(9),(4 5
6),(8)gt. lt(3), (5)gt is contained in lt(3
5)gt. Def 2. A sequence s is maximal if s is
not contained in any other sequence. - Ti is
transaction time. - itemset(Ti) is transaction
the set of items in Ti. - litemset an item
set with minimum support.
Yes
No
19Procedure
- Procedure
- 1. Convert D into a D of customer sequences.
- 2. Litemset mapping
- 3. Transform each customer sequence into a
litemset - representation. lts1, s2, , sngt ? ltl1, l3, ,
lngt - 4. Find the desired sequences using the set of
litemsets. - 4-1. AprioriAll
- 4-2. AprioriSome
- 4-3. DynamicSome
- 5. Find the maximal sequences among the set of
large sequences. - for(k n k gt 1 k--)
- foreach k-sequence sk
- delete from S all subsequences of sk.
20Example
step 2
Mapped to
Large Itemsets
1 2 3 4 5
(30) (40) (70) (40 70) (90)
step 3
CID
Customer Sequence
Transformed Sequence
Mapping
1 2 3 4 5
lt(30),(90)gt lt(10 20),(30),(40 60 70)gt lt(30 50
70)gt lt(30),(40 70),(90)gt lt(90)gt
lt(30)(90)gt lt(30)(40),(70),(40
70)gt lt(30),(70)gt lt(30)(40),(70),(40
70)(90)gt lt(90)gt
lt15gt lt12,3,4gt lt1,3gt lt12,3,45gt lt5
gt
21AprioriAll
Aprioriall() Lk frequent 1-itemsets
k 2 / k represents the pass
number. / while (Lk-1 ! ?) F
F U Lk Ck New candidates of size k
generated from Lk-1 for each
customer-sequence c ? D increment
the count of all candidates in Ck that are
contained in c Lk All candidates in
Ck with minimum support k
return ( F )
22Example
Customer Seqs.
Minimum support .40
C4
lt1,5234gt lt1343,5gt lt1234gt lt
135gt lt45gt
lt 1 2 3 4 gt lt 1 2 4 3 gt lt 1 3 4 5 gt lt 1 3 5 4 gt
L3 Supp
lt 1 2 3 gt 2 lt 1 2 4 gt 2 lt 1 3 4 gt 3 lt
1 3 5 gt 2 lt 2 3 4 gt 2
L2 Supp
lt 1 2 gt 2 lt 1 3 gt 2 lt 1 4 gt 3 lt 1 5 gt
2 lt 2 3 gt 2 lt 2 4 gt 2 lt 3 4 gt 2 lt
3 5 gt 3 lt 4 5 gt 2
L1 Supp
L4 Supp.
lt 1 gt 4 lt 2 gt 2 lt 3 gt 4 lt 4 gt 4 lt
5 gt 4
lt 1 2 3 4 gt 2
The maximal large sequences are lt1 2 3 4gt, lt1 3
5gt, lt4 5gt.
23AprioriSome and DynamicSome algorithms to find
association rules with sequential patterns will
be added.
24GSC features
Gradient (local) Structural (intermediate) and
Concavity (global)
25GSC feature table
26A sample GSC features
Gradient 000000000011000000001100001110000000111
0000000110000001100010000000011000000000000011100
1100011111000011110000000010 010100000100011100111
11001111100000100000100000000000000000000 01000001
001000 (192) Structure
00000000000000000000110000111000100001000010000001
0000 000000000100101000000000011000010100110000110
000000000000100100 0110011000000000000001100101000
00000000001100000000000000000000 000000010000
(192) Concavity 1111011010011111011001100
0000110111101101001100100000 110000011100000000000
000000000000000000000000000000111111100000 0000000
00000 (128)
27Class A
800 samples
28Class A, B and C
A
B
C
29Reordered by Frequency
A
B
C
30Associate Rules in GSC
- G ? S, G ? C - F1, F2, F3 ? A - F1 ? F2
31References
Agrawal, R. Imielinski, T. and Swami, A.
Mining Association Rules between Sets of Items
in Large Databases. Proc. Of the ACM SIGMOD
Conference on Management of Data, 207-216,
1993 Agrawal, R. and Srikant, R. Fast
Algorithms for Mining Association Rules in Large
Databases Proc. Of the 20th Intl Conference on
Very Large Databases, p478-499, Sept.
1994 Agrawal, R. and Srikant, R. Mining
Sequential Patterns, Research Report RJ 9910,
IBM Almaden Research Center, San Jose,
California, October 1994 Agrawal, R. and Shafer,
J. Parallel Mining of Association Rules.
IEEE Transactions on Knowledge and Data
Engineering. 8(6), 1996
32MultipleJoins Algorithm
Generate the Selected Item Set() S ?
for each Di, i 1 to m for each
ai,j, j 1 to n cost of conjunct
support(S U ai,j) - support(ai,j) Add
ai,j with the minimum cost to S
- Procedure
- 1. Scan the data and determine F.
- 2. Ls(1) S ? F
- ...
B (A ? B) ? (C ? ?E)
The algorithm gives S A, C .