Title: Huffman Codes and Asssociation Rules (II)
1Huffman Codes and Asssociation Rules (II)
Lecture 15
- Prof. Sin-Min Lee
- Department of Computer Science
2Huffman Code Example
- Given A B C D E
- 3 1 2 4 6
- By using an increasing algorithm (changing from
smallest to largest), it changes to - B C A D E
- 1 2 3 4 6
3Huffman Code Example Step 1
- Because B and C are the lowest values, they can
be appended. The new value is 3
4Huffman Code Example Step 2
- Reorder the problem using the increasing
algorithm again. This gives us - BC A D E
- 3 3 4 6
5Huffman Code Example Step 3
- Doing another append will give
6Huffman Code Example Step 4
- From the initial BC A D E code we get
- D E ABC
- 6 6
- D E BCA
- 4 6 6
- D ABC E
- 4 6 6
- D BCA E
- 4 6 6
7Huffman Code Example Step 5
- Taking derivates from the previous step, we get
- D E BCA
- 4 6 6
- E DBCA
- 6 10
- DABC E
- 10 6
- D E ABC
- 4 6 6
8Huffman Code Example Step 6
- Taking derivates from the previous step, we get
- BCA D E
- 6 4 6
- E DBCA
- 6 10
- E DABC
- 40 10
- ABC D E
- 6 4 6
9Huffman Code Example Step 7
- After the previous step, were supposed to map a
1 to each right branch and a 0 to each left
branch. The results of the codes are
10Example
- Itemsmilk, coke, pepsi, beer, juice.
- Support 3 baskets.
- B1 m, c, b B2 m, p, j B3 m, b
- B4 c, j B5 m, p, b B6 m,
c, b, j - B7 c, b, j B8 b, c
- Frequent itemsets m, c, b, j, m,
b, c, b, j, c.
11Association Rules
- Association rule R Itemset1 gt Itemset2
- Itemset1, 2 are disjoint and Itemset2 is
non-empty - meaning if transaction includes Itemset1 then
it also has Itemset2 - Examples
- A,B gt E,C
- A gt B,C
12Example
- B1 m, c, b B2 m, p, j
- B3 m, b B4 c, j
- B5 m, p, b B6 m, c, b, j
- B7 c, b, j B8 b, c
- An association rule m, b ? c.
- Confidence 2/4 50.
_ _
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17From Frequent Itemsets to Association Rules
- Q Given frequent set A,B,E, what are possible
association rules? - A gt B, E
- A, B gt E
- A, E gt B
- B gt A, E
- B, E gt A
- E gt A, B
- __ gt A,B,E (empty rule), or true gt A,B,E
18Classification vs Association Rules
- Classification Rules
- Focus on one target field
- Specify class in all cases
- Measures Accuracy
- Association Rules
- Many target fields
- Applicable in some cases
- Measures Support, Confidence, Lift
19Rule Support and Confidence
- Suppose R I gt J is an association rule
- sup (R) sup (I ? J) is the support count
- support of itemset I ? J (I or J)
- conf (R) sup(J) / sup(R) is the confidence of R
- fraction of transactions with I ? J that have J
- Association rules with minimum support and count
are sometimes called strong rules
20Association Rules Example
- Q Given frequent set A,B,E, what association
rules have minsup 2 and minconf 50 ? - A, B gt E conf2/4 50
- A, E gt B conf2/2 100
- B, E gt A conf2/2 100
- E gt A, B conf2/2 100
- Dont qualify
- A gtB, E conf2/6 33lt 50
- B gt A, E conf2/7 28 lt 50
- __ gt A,B,E conf 2/9 22 lt 50
-
21Find Strong Association Rules
- A rule has the parameters minsup and minconf
- sup(R) gt minsup and conf (R) gt minconf
- Problem
- Find all association rules with given minsup and
minconf - First, find all frequent itemsets
22Finding Frequent Itemsets
- Start by finding one-item sets (easy)
- Q How?
- A Simply count the frequencies of all items
23Finding itemsets next level
- Apriori algorithm (Agrawal Srikant)
- Idea use one-item sets to generate two-item
sets, two-item sets to generate three-item sets,
- If (A B) is a frequent item set, then (A) and (B)
have to be frequent item sets as well! - In general if X is frequent k-item set, then all
(k-1)-item subsets of X are also frequent - Compute k-item set by merging (k-1)-item sets
24(No Transcript)
25Finding Association Rules
- A typical question find all association rules
with support s and confidence c. - Note support of an association rule is the
support of the set of items it mentions. - Hard part finding the high-support (frequent )
itemsets. - Checking the confidence of association rules
involving those sets is relatively easy.
26Naïve Algorithm
- A simple way to find frequent pairs is
- Read file once, counting in main memory the
occurrences of each pair. - Expand each basket of n items into its n (n
-1)/2 pairs. - Fails if items-squared exceeds main memory.
27(No Transcript)
28Filter
Filter
Construct
Construct
C1
L1
C2
L2
C3
First pass
Second pass
29Agrawal, Srikant 94
Fast Algorithms for Mining Association Rules, by
Rakesh Agrawal and Ramakrishan Sikant, IBM
Almaden Research Center
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36(No Transcript)
37(No Transcript)
38(No Transcript)
39C1
L1
Database
Items TID
1 3 4 100
2 3 5 200
1 2 3 5 300
2 5 400
Set-of-itemsets TID
1,3,4 100
2,3,5 200
1,2,3,5 300
2,5 400
Support Itemset
2 1
3 2
3 3
3 5
C2
L2
C2
itemset
1 2
1 3
1 5
2 3
2 5
3 5
Set-of-itemsets TID
1 3 100
2 3,2 5 3 5 200
1 2,1 3,1 5, 2 3, 2 5, 3 5 300
2 5 400
Support Itemset
2 1 3
3 2 3
3 2 5
2 3 5
C3
L3
C3
Set-of-itemsets TID
2 3 5 200
2 3 5 300
Support Itemset
2 2 3 5
itemset
2 3 5
40(No Transcript)
41(No Transcript)
42(No Transcript)
43- Dynamic Programming Approach
- Want proof of principle of optimality and
overlapping subproblems - Principle of Optimality
- The optimal solution to Lk includes the
optimal solution of Lk-1 - Proof by contradiction
- Overlapping Subproblems
- Lemma of every subset of a frequent item set is a
frequent item set - Proof by contradiction
44(No Transcript)
45(No Transcript)
46(No Transcript)
47(No Transcript)
48(No Transcript)
49(No Transcript)
50(No Transcript)
51(No Transcript)
52(No Transcript)
53(No Transcript)
54(No Transcript)
55(No Transcript)
56The Apriori Algorithm Example
- Consider a database, D , consisting of 9
transactions. - Suppose min. support count required is 2 (i.e.
min_sup 2/9 22 ) - Let minimum confidence required is 70.
- We have to first find out the frequent itemset
using Apriori algorithm. - Then, Association rules will be generated using
min. support min. confidence.
TID List of Items
T100 I1, I2, I5
T100 I2, I4
T100 I2, I3
T100 I1, I2, I4
T100 I1, I3
T100 I2, I3
T100 I1, I3
T100 I1, I2 ,I3, I5
T100 I1, I2, I3
57Step 1 Generating 1-itemset Frequent Pattern
Compare candidate support count with minimum
support count
Scan D for count of each candidate
Itemset Sup.Count
I1 6
I2 7
I3 6
I4 2
I5 2
Itemset Sup.Count
I1 6
I2 7
I3 6
I4 2
I5 2
C1
L1
- In the first iteration of the algorithm, each
item is a member of the set of candidate. - The set of frequent 1-itemsets, L1 , consists of
the candidate 1-itemsets satisfying minimum
support.
58Step 2 Generating 2-itemset Frequent Pattern
Generate C2 candidates from L1
Compare candidate support count with minimum
support count
Itemset
I1, I2
I1, I3
I1, I4
I1, I5
I2, I3
I2, I4
I2, I5
I3, I4
I3, I5
I4, I5
Itemset Sup. Count
I1, I2 4
I1, I3 4
I1, I4 1
I1, I5 2
I2, I3 4
I2, I4 2
I2, I5 2
I3, I4 0
I3, I5 1
I4, I5 0
Itemset Sup Count
I1, I2 4
I1, I3 4
I1, I5 2
I2, I3 4
I2, I4 2
I2, I5 2
Scan D for count of each candidate
L2
C2
C2
59Step 2 Generating 2-itemset Frequent Pattern
Cont.
- To discover the set of frequent 2-itemsets, L2 ,
the algorithm uses L1 Join L1 to generate a
candidate set of 2-itemsets, C2. - Next, the transactions in D are scanned and the
support count for each candidate itemset in C2 is
accumulated (as shown in the middle table). - The set of frequent 2-itemsets, L2 , is then
determined, consisting of those candidate
2-itemsets in C2 having minimum support. - Note We havent used Apriori Property yet.
60Step 3 Generating 3-itemset Frequent Pattern
Compare candidate support count with min support
count
Itemset Sup. Count
I1, I2, I3 2
I1, I2, I5 2
Itemset Sup Count
I1, I2, I3 2
I1, I2, I5 2
Scan D for count of each candidate
Scan D for count of each candidate
Itemset
I1, I2, I3
I1, I2, I5
C3
L3
C3
- The generation of the set of candidate
3-itemsets, C3 , involves use of the Apriori
Property. - In order to find C3, we compute L2 Join L2.
- C3 L2 Join L2 I1, I2, I3, I1, I2, I5,
I1, I3, I5, I2, I3, I4, I2, I3, I5, I2,
I4, I5. - Now, Join step is complete and Prune step will
be used to reduce the size of C3. Prune step
helps to avoid heavy computation due to large Ck.
61Step 3 Generating 3-itemset Frequent Pattern
Cont.
- Based on the Apriori property that all subsets of
a frequent itemset must also be frequent, we can
determine that four latter candidates cannot
possibly be frequent. How ? - For example , lets take I1, I2, I3. The 2-item
subsets of it are I1, I2, I1, I3 I2, I3.
Since all 2-item subsets of I1, I2, I3 are
members of L2, We will keep I1, I2, I3 in C3. - Lets take another example of I2, I3, I5 which
shows how the pruning is performed. The 2-item
subsets are I2, I3, I2, I5 I3,I5. - BUT, I3, I5 is not a member of L2 and hence it
is not frequent violating Apriori Property. Thus
We will have to remove I2, I3, I5 from C3. - Therefore, C3 I1, I2, I3, I1, I2, I5
after checking for all members of result of Join
operation for Pruning. - Now, the transactions in D are scanned in order
to determine L3, consisting of those candidates
3-itemsets in C3 having minimum support.
62Step 4 Generating 4-itemset Frequent Pattern
- The algorithm uses L3 Join L3 to generate a
candidate set of 4-itemsets, C4. Although the
join results in I1, I2, I3, I5, this itemset
is pruned since its subset I2, I3, I5 is not
frequent. - Thus, C4 f , and algorithm terminates, having
found all of the frequent items. This completes
our Apriori Algorithm. - Whats Next ?
- These frequent itemsets will be used to generate
strong association rules ( where strong
association rules satisfy both minimum support
minimum confidence).
63Step 5 Generating Association Rules from
Frequent Itemsets
- Procedure
- For each frequent itemset l, generate all
nonempty subsets of l. - For every nonempty subset s of l, output the rule
s ? (l-s) if - support_count(l) / support_count(s) gt min_conf
where min_conf is minimum confidence threshold. - Back To Example
- We had L I1, I2, I3, I4, I5,
I1,I2, I1,I3, I1,I5, I2,I3, I2,I4,
I2,I5, I1,I2,I3, I1,I2,I5. - Lets take l I1,I2,I5.
- Its all nonempty subsets are I1,I2, I1,I5,
I2,I5, I1, I2, I5.
64Step 5 Generating Association Rules from
Frequent Itemsets Cont.
- Let minimum confidence threshold is , say 70.
- The resulting association rules are shown below,
each listed with its confidence. - R1 I1 I2 ? I5
- Confidence scI1,I2,I5/scI1,I2 2/4 50
- R1 is Rejected.
- R2 I1 I5 ? I2
- Confidence scI1,I2,I5/scI1,I5 2/2 100
- R2 is Selected.
- R3 I2 I5 ? I1
- Confidence scI1,I2,I5/scI2,I5 2/2 100
- R3 is Selected.
65Step 5 Generating Association Rules from
Frequent Itemsets Cont.
- R4 I1 ? I2 I5
- Confidence scI1,I2,I5/scI1 2/6 33
- R4 is Rejected.
- R5 I2 ? I1 I5
- Confidence scI1,I2,I5/I2 2/7 29
- R5 is Rejected.
- R6 I5 ? I1 I2
- Confidence scI1,I2,I5/ I5 2/2 100
- R6 is Selected.
- In this way, We have found three strong
association rules.
66Example
Simple algorithm
ABCDE
Large itemset
ACDE?B
ABCE?D
Rules with minsup
CDE?AB
BCE?AD
ABE?CD
ADE?BC
ACD?BE
ACE?BD
ACE?BD
ABC?ED
ABCDE
Fast algorithm
ACDE?B
ABCE?D
ACE?BD
67(No Transcript)
68(No Transcript)
69(No Transcript)
70(No Transcript)
71(No Transcript)
72(No Transcript)
73(No Transcript)