Title: Mid 2 revision and Asssociation Rules II
1Mid 2 revision and Asssociation Rules (II)
Lecture 16
- Prof. Sin-Min Lee
- Department of Computer Science
2Example Data Instance
STUDENT
COURSE
Takes
PROFESSOR
Teaches
3Natural Join and Intersection
- Natural join special case of join where ? is
implicit attributes with same name must be
equal - STUDENT ? Takes STUDENT ?STUDENT.sid
Takes.sid Takes - Intersection as with set operations, derivable
from difference
A ? B
B-A
A-B
A B
4Division
- A somewhat messy operation that can be expressed
in terms of the operations we have already
defined - Used to express queries such as The fid's of
faculty who have taught all subjects - Paraphrased The fids of professors for which
there does not exist a subject that they havent
taught
5Division Using Our Existing Operators
- All possible teaching assignments Allpairs
- NotTaught, all (fid,subj) pairs for which
professor fid has not taught subj - Answer is all faculty not in NotTaught
?fid,subj (PROFESSORx ?subj(COURSE))
Allpairs - ?fid,subj(Teaches ? COURSE)
?fid(PROFESSOR) - ?fid(NotTaught)
?fid(PROFESSOR) - ?fid(
?fid,subj (PROFESSOR x ?subj(COURSE)) -
?fid,subj(Teaches ? COURSE))
6Division R1 R2
- Requirement schema(R1) ¾ schema(R2)
- Result schema schema(R1) schema(R2)
- Professors who have taught all courses
- What about Courses that have been taught by all
faculty?
?fid (?fid,subj(Teaches ? COURSE) ?subj(COURSE))
7Division
- Goal Produce the tuples in one relation, r, that
match all tuples in another relation, s - r (A1, An, B1, Bm)
- s (B1 Bm)
- r/s, with attributes A1, An, is the set of all
tuples ltagt such that for every tuple ltbgt in s,
lta,bgt is in r - Can be expressed in terms of projection, set
difference, and cross-product
8Division (contd)
9Division - Example
- List the Ids of students who have passed all
courses that were taught in spring 2000 - Numerator
- StudId and CrsCode for every course passed by
every student - ?StudId, CrsCode (?Grade? F (Transcript) )
- Denominator
- CrsCode of all courses taught in spring 2000
- ?CrsCode (?SemesterS2000 (Teaching) )
- Result is numerator/denominator
10Relational Calculus
- Important features
- Declarative formal query languages for relational
model - Based on the branch mathematical logic known as
predicate calculus - Two types of RC
- 1) tuple relational calculus
- 2) domain relational calculus
- A single statement can be used to perform a query
11Tuple Relational Calculus
- based on specifying a number of tuple variables
- a tuple variable refers to any tuple
12Generic Form
- t COND (t)
- where
- t is a tuple variable and
- COND(t) is Boolean expression involving t
13Simple example 1
- To find all employees whose salary is greater
than 50,000 - t EMPLOYEE(t) and t.Salarygt5000
- where
- EMPLOYEE(t) specifies the range of tuple variable
t - The above operation selects all the attributes
14Simple example 2
- To find only the names of employees whose salary
is greater than 50,000 - t.FNAME, t.NAME EMPLOYEE(t) and t.Salarygt5000
- The above is equivalent to
- SELECT T.FNAME, T.LNAME
- FROM EMPLOYEE T
- WHERE T.SALARY gt 5000
15Elements of a tuple calculus
- In general, we need to specify the following in a
tuple calculus expression - Range Relation (I.e, R(t)) FROM
- Selected combination WHERE
- Requested attributes SELECT
16 More ExampleQ0
- Retrieve the birthrate and address of the
employee(s) whose name is John B. Smith - t.BDATE, t.ADDRESS EMPLOYEE(t) AND
t.FNAMEJohn AND t.MINITB AND
t.LNAMESmith
17Formal Specification of tuple Relational Calculus
- A general format
- t1.A1, t2.A2,,tn.An COND ( t1 ,t2 ,, tn,
tn1, tn2,,tnm) - where
- t1,,tnm are tuple var
- Ai attribute?R(ti)
- COND (formula)
- Where COND corresponds to statement about the
world, which can be True or False
18Elements of formula
- A formula is made of Predicate Calculus atoms
- an atom of the from R(ti)
- ti.A op tj.B op?, lt,gt,..
- F1 And F2 where F1 and F2 are formulas
- F1 OR F2
- Not (F1)
- F(?t) (F) or F (?t) (F)
- ? Y friends (Y, John)
- ?X likes(X, ICE_CREAM)
-
19Example Queries Using the Existential Quantifier
- Retrieve the name and address of all employees
who work for the Research department - t.FNAME, t.LNAME, t.ADDRESS EMPLOYEE(t) AND (?
d) (DEPARTMENT (d) AND d.DNAMEResearch AND
d.DNUMBERt.DNO)
20More Example
- For every project located in Stafford, retrieve
the project number, the controlling department
number, and the last name, birthrate, and address
of the manger of that department.
21Cont.
- p.PNUMBER,p.DNUM,m.LNAME,m.BDATE,
m.ADDRESSPROJECT(p) and EMPLOYEE(M) and
P.PLOCATIONStafford and (? d) (DEPARTMENT(D)
AND P.DNUMd.DNUMBER and d.MGRSSNm.SSN))
22Logical Equivalences
- There are two logical equivalences that will be
heavily used - p?q ? ?p ? q (Whenever p is true, q must also
be true.) - ?x. p(x) ? ??x. ?p(x) (p is true for all x)
- The second can be a lot easier to check!
23Normalization
- Review on Keys
- superkey a set of attributes which will uniquely
identify each tuple in a relation - candidate key a minimal superkey
- primary key a chosen candidate key
- secondary key all the rest of candiate keys
- prime attribute an attribute that is a part of a
candidate key (key column) - nonprime attribute a nonkey column
24Normalization
- Functional Dependency Type by Keys
- whole (candidate) key ? nonprime attribute
full FD (no violation) - partial key ? nonprime attribute partial FD
(violation of 2NF) - nonprime attribute ? nonprime attribute
transitive FD (violation of 3NF) - not a whole key ? prime attribute violation of
BCNF
25Functional Dependencies
- Let R be a relation schema
- ? ? R and ? ? R
- The functional dependency
- ? ? ?holds on R iff for any legal relations
r(R), whenever two tuples t1 and t2 of r have
same values for ?, they have same values for ?.
- t1? t2 ? ? t1? t2 ?
- On this instance, A ? B does NOT hold, but B ? A
does hold.
A B
261. Closure
- Given a set of functional dependencies, F, its
closure, F , is all FDs that are implied by FDs
in F. - e.g. If A ? B, and B ? C,
- then clearly A ? C
27Armstrongs Axioms
- We can find F by applying Armstrongs Axioms
- if ? ? ?, then ? ? ?
(reflexivity) - if ? ? ?, then ? ? ? ? ?
(augmentation) - if ? ? ?, and ? ? ?, then ? ? ? (transitivity)
- These rules are
- sound (generate only functional dependencies that
actually hold) and - complete (generate all functional dependencies
that hold).
28Additional rules
- If ? ? ? and ? ? ?, then ? ? ? ? (union)
- If ? ? ? ?, then ? ? ? and ? ? ? (decomposition)
- If ? ? ? and ? ? ? ?, then ? ? ? ?
(pseudotransitivity) - The above rules can be inferred from Armstrongs
axioms.
29Example
- R (A, B, C, G, H, I)F A ? B A ? C CG
? H CG ? I B ? H - Some members of F
- A ? H
- by transitivity from A ? B and B ? H
- AG ? I
- by augmenting A ? C with G, to get AG ? CG
and then transitivity with CG ? I - CG ? HI
- by augmenting CG ? I to infer CG ? CGI,
- and augmenting of CG ? H to infer CGI ? HI,
- and then transitivity
302. Closure of an attribute set
- Given a set of attributes A and a set of FDs F,
closure of A under F is the set of all attributes
implied by A - In other words, the largest B such that
- A ? B
- Redefining super keys
- The closure of a super key is the entire
relation schema - Redefining candidate keys
- 1. It is a super key
- 2. No subset of it is a super key
31Computing the closure for A
- Simple algorithm
- 1. Start with B A.
- 2. Go over all functional dependencies, ? ? ? ,
in F - 3. If ? ? B, then
- Add ? to B
- 4. Repeat till B changes
32Example
- R (A, B, C, G, H, I)F A ? B A ? C CG
? H CG ? I B ? H - (AG) ?
- 1. result AG
- 2. result ABCG (A ? C and A ? B)
- 3. result ABCGH (CG ? H and CG ? AGBC)
- 4. result ABCGHI (CG ? I and CG ? AGBCH
- Is (AG) a candidate key ?
- 1. It is a super key.
- 2. (A) BC, (G) G.
- YES.
33Support
- Simplest question find sets of items that appear
frequently in the baskets. - Support for itemset I the number of baskets
containing all items in I. - Given a support threshold s, sets of items that
appear in gt s baskets are called frequent
itemsets.
34Example
- Itemsmilk, coke, pepsi, beer, juice.
- Support 3 baskets.
- B1 m, c, b B2 m, p, j B3 m, b
- B4 c, j B5 m, p, b B6 m,
c, b, j - B7 c, b, j B8 b, c
- Frequent itemsets m, c, b, j, m,
b, c, b, j, c.
35Association Rules
- Association rule R Itemset1 gt Itemset2
- Itemset1, 2 are disjoint and Itemset2 is
non-empty - meaning if transaction includes Itemset1 then
it also has Itemset2 - Examples
- A,B gt E,C
- A gt B,C
36Example
- B1 m, c, b B2 m, p, j
- B3 m, b B4 c, j
- B5 m, p, b B6 m, c, b, j
- B7 c, b, j B8 b, c
- An association rule m, b ? c.
- Confidence 2/4 50.
_ _
37(No Transcript)
38(No Transcript)
39(No Transcript)
40(No Transcript)
41From Frequent Itemsets to Association Rules
- Q Given frequent set A,B,E, what are possible
association rules? - A gt B, E
- A, B gt E
- A, E gt B
- B gt A, E
- B, E gt A
- E gt A, B
- __ gt A,B,E (empty rule), or true gt A,B,E
42Classification vs Association Rules
- Classification Rules
- Focus on one target field
- Specify class in all cases
- Measures Accuracy
- Association Rules
- Many target fields
- Applicable in some cases
- Measures Support, Confidence, Lift
43- Apriori Algorithm (1997)
- Let I a,b,c, be a set of all items in the
domain - Let T S S ? I be a bag of all transaction
records of item sets - Let support(S) ? A A ? T ? S ? A
- Let L1 a a ? I ? support(a) ?
minSupport - ?k (k gt 1 ? Lk-1 ? ?) Let
- Lk Si ? Sj (Si ? Lk-1) ? (Sj ? Lk-1) ?
- ( Si Sj 1 ) ? (
Sj Si 1) ? - ( ?S ((S ? Si ? Sj) ?
(S k-1)) ? S ? Lk-1 ) ? - ( support(Si ? Sj) ?
minSupport ) - The set of all frequent item sets is given by
- L ?Lk
-
- and the set of all association rules is
given by - R A ? C A ? ?(Lk) ? (C Lk A) ? (A ?
?) ? (C ? ?) - support(Lk) /
support(A) ? minConfidence
44Rule Support and Confidence
- Suppose R I gt J is an association rule
- sup (R) sup (I ? J) is the support count
- support of itemset I ? J (I or J)
- conf (R) sup(J) / sup(R) is the confidence of R
- fraction of transactions with I ? J that have J
- Association rules with minimum support and count
are sometimes called strong rules
45Association Rules Example
- Q Given frequent set A,B,E, what association
rules have minsup 2 and minconf 50 ? - A, B gt E conf2/4 50
- A, E gt B conf2/2 100
- B, E gt A conf2/2 100
- E gt A, B conf2/2 100
- Dont qualify
- A gtB, E conf2/6 33lt 50
- B gt A, E conf2/7 28 lt 50
- __ gt A,B,E conf 2/9 22 lt 50
-
46Find Strong Association Rules
- A rule has the parameters minsup and minconf
- sup(R) gt minsup and conf (R) gt minconf
- Problem
- Find all association rules with given minsup and
minconf - First, find all frequent itemsets
47Finding Frequent Itemsets
- Start by finding one-item sets (easy)
- Q How?
- A Simply count the frequencies of all items
48Finding itemsets next level
- Apriori algorithm (Agrawal Srikant)
- Idea use one-item sets to generate two-item
sets, two-item sets to generate three-item sets,
- If (A B) is a frequent item set, then (A) and (B)
have to be frequent item sets as well! - In general if X is frequent k-item set, then all
(k-1)-item subsets of X are also frequent - Compute k-item set by merging (k-1)-item sets
49(No Transcript)
50Finding Association Rules
- A typical question find all association rules
with support s and confidence c. - Note support of an association rule is the
support of the set of items it mentions. - Hard part finding the high-support (frequent )
itemsets. - Checking the confidence of association rules
involving those sets is relatively easy.
51Naïve Algorithm
- A simple way to find frequent pairs is
- Read file once, counting in main memory the
occurrences of each pair. - Expand each basket of n items into its n (n
-1)/2 pairs. - Fails if items-squared exceeds main memory.
52(No Transcript)
53Filter
Filter
Construct
Construct
C1
L1
C2
L2
C3
First pass
Second pass
54Agrawal, Srikant 94
Fast Algorithms for Mining Association Rules, by
Rakesh Agrawal and Ramakrishan Sikant, IBM
Almaden Research Center
55(No Transcript)
56(No Transcript)
57(No Transcript)
58(No Transcript)
59(No Transcript)
60(No Transcript)
61(No Transcript)
62(No Transcript)
63(No Transcript)
64C1
L1
Database
C2
L2
C2
C3
L3
C3
65(No Transcript)
66(No Transcript)
67(No Transcript)
68- Dynamic Programming Approach
- Want proof of principle of optimality and
overlapping subproblems - Principle of Optimality
- The optimal solution to Lk includes the
optimal solution of Lk-1 - Proof by contradiction
- Overlapping Subproblems
- Lemma of every subset of a frequent item set is a
frequent item set - Proof by contradiction
69(No Transcript)
70(No Transcript)
71(No Transcript)
72(No Transcript)
73(No Transcript)
74(No Transcript)
75(No Transcript)
76(No Transcript)
77(No Transcript)
78(No Transcript)
79(No Transcript)
80(No Transcript)
81The Apriori Algorithm Example
- Consider a database, D , consisting of 9
transactions. - Suppose min. support count required is 2 (i.e.
min_sup 2/9 22 ) - Let minimum confidence required is 70.
- We have to first find out the frequent itemset
using Apriori algorithm. - Then, Association rules will be generated using
min. support min. confidence.
82Step 1 Generating 1-itemset Frequent Pattern
Compare candidate support count with minimum
support count
Scan D for count of each candidate
C1
L1
- In the first iteration of the algorithm, each
item is a member of the set of candidate. - The set of frequent 1-itemsets, L1 , consists of
the candidate 1-itemsets satisfying minimum
support.
83Step 2 Generating 2-itemset Frequent Pattern
Generate C2 candidates from L1
Compare candidate support count with minimum
support count
Scan D for count of each candidate
L2
C2
C2
84Step 2 Generating 2-itemset Frequent Pattern
Cont.
- To discover the set of frequent 2-itemsets, L2 ,
the algorithm uses L1 Join L1 to generate a
candidate set of 2-itemsets, C2. - Next, the transactions in D are scanned and the
support count for each candidate itemset in C2 is
accumulated (as shown in the middle table). - The set of frequent 2-itemsets, L2 , is then
determined, consisting of those candidate
2-itemsets in C2 having minimum support. - Note We havent used Apriori Property yet.
85Step 3 Generating 3-itemset Frequent Pattern
Compare candidate support count with min support
count
Scan D for count of each candidate
Scan D for count of each candidate
C3
L3
C3
- The generation of the set of candidate
3-itemsets, C3 , involves use of the Apriori
Property. - In order to find C3, we compute L2 Join L2.
- C3 L2 Join L2 I1, I2, I3, I1, I2, I5,
I1, I3, I5, I2, I3, I4, I2, I3, I5, I2,
I4, I5. - Now, Join step is complete and Prune step will
be used to reduce the size of C3. Prune step
helps to avoid heavy computation due to large Ck.
86Step 3 Generating 3-itemset Frequent Pattern
Cont.
- Based on the Apriori property that all subsets of
a frequent itemset must also be frequent, we can
determine that four latter candidates cannot
possibly be frequent. How ? - For example , lets take I1, I2, I3. The 2-item
subsets of it are I1, I2, I1, I3 I2, I3.
Since all 2-item subsets of I1, I2, I3 are
members of L2, We will keep I1, I2, I3 in C3. - Lets take another example of I2, I3, I5 which
shows how the pruning is performed. The 2-item
subsets are I2, I3, I2, I5 I3,I5. - BUT, I3, I5 is not a member of L2 and hence it
is not frequent violating Apriori Property. Thus
We will have to remove I2, I3, I5 from C3. - Therefore, C3 I1, I2, I3, I1, I2, I5
after checking for all members of result of Join
operation for Pruning. - Now, the transactions in D are scanned in order
to determine L3, consisting of those candidates
3-itemsets in C3 having minimum support.
87Step 4 Generating 4-itemset Frequent Pattern
- The algorithm uses L3 Join L3 to generate a
candidate set of 4-itemsets, C4. Although the
join results in I1, I2, I3, I5, this itemset
is pruned since its subset I2, I3, I5 is not
frequent. - Thus, C4 f , and algorithm terminates, having
found all of the frequent items. This completes
our Apriori Algorithm. - Whats Next ?
- These frequent itemsets will be used to generate
strong association rules ( where strong
association rules satisfy both minimum support
minimum confidence).
88Step 5 Generating Association Rules from
Frequent Itemsets
- Procedure
- For each frequent itemset l, generate all
nonempty subsets of l. - For every nonempty subset s of l, output the rule
s ? (l-s) if - support_count(l) / support_count(s) gt min_conf
where min_conf is minimum confidence threshold. - Back To Example
- We had L I1, I2, I3, I4, I5,
I1,I2, I1,I3, I1,I5, I2,I3, I2,I4,
I2,I5, I1,I2,I3, I1,I2,I5. - Lets take l I1,I2,I5.
- Its all nonempty subsets are I1,I2, I1,I5,
I2,I5, I1, I2, I5.
89Step 5 Generating Association Rules from
Frequent Itemsets Cont.
- Let minimum confidence threshold is , say 70.
- The resulting association rules are shown below,
each listed with its confidence. - R1 I1 I2 ? I5
- Confidence scI1,I2,I5/scI1,I2 2/4 50
- R1 is Rejected.
- R2 I1 I5 ? I2
- Confidence scI1,I2,I5/scI1,I5 2/2 100
- R2 is Selected.
- R3 I2 I5 ? I1
- Confidence scI1,I2,I5/scI2,I5 2/2 100
- R3 is Selected.
90Step 5 Generating Association Rules from
Frequent Itemsets Cont.
- R4 I1 ? I2 I5
- Confidence scI1,I2,I5/scI1 2/6 33
- R4 is Rejected.
- R5 I2 ? I1 I5
- Confidence scI1,I2,I5/I2 2/7 29
- R5 is Rejected.
- R6 I5 ? I1 I2
- Confidence scI1,I2,I5/ I5 2/2 100
- R6 is Selected.
- In this way, We have found three strong
association rules.
91Example
Simple algorithm
ABCDE
Large itemset
ACDE?B
ABCE?D
Rules with minsup
CDE?AB
BCE?AD
ABE?CD
ADE?BC
ACD?BE
ACE?BD
ACE?BD
ABC?ED
ABCDE
Fast algorithm
ACDE?B
ABCE?D
ACE?BD
92(No Transcript)
93(No Transcript)
94(No Transcript)
95(No Transcript)
96(No Transcript)
97(No Transcript)
98(No Transcript)