Mid 2 revision and Asssociation Rules II - PowerPoint PPT Presentation

1 / 98

About This Presentation

Title:

Mid 2 revision and Asssociation Rules II

Description:

Natural Join and Intersection. Natural join: special case of join where is implicit attributes with same name ... Ai : attribute R(ti) COND (formula) ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 99

Provided by: Lee144

Category:

Tags: ai | asssociation | mid | revision | rules

more less

Transcript and Presenter's Notes

Title: Mid 2 revision and Asssociation Rules II

1
Mid 2 revision and Asssociation Rules (II)
Lecture 16

Prof. Sin-Min Lee
Department of Computer Science

2
Example Data Instance
STUDENT
COURSE
Takes
PROFESSOR
Teaches
3
Natural Join and Intersection

Natural join special case of join where ? is
implicit attributes with same name must be
equal
STUDENT ? Takes STUDENT ?STUDENT.sid
Takes.sid Takes
Intersection as with set operations, derivable
from difference

A ? B
B-A
A-B
A B
4
Division

A somewhat messy operation that can be expressed
in terms of the operations we have already
defined
Used to express queries such as The fid's of
faculty who have taught all subjects
Paraphrased The fids of professors for which
there does not exist a subject that they havent
taught

5
Division Using Our Existing Operators

All possible teaching assignments Allpairs
NotTaught, all (fid,subj) pairs for which
professor fid has not taught subj
Answer is all faculty not in NotTaught

?fid,subj (PROFESSORx ?subj(COURSE))
Allpairs - ?fid,subj(Teaches ? COURSE)
?fid(PROFESSOR) - ?fid(NotTaught)
?fid(PROFESSOR) - ?fid(
?fid,subj (PROFESSOR x ?subj(COURSE)) -
?fid,subj(Teaches ? COURSE))
6
Division R1 R2

Requirement schema(R1) ¾ schema(R2)
Result schema schema(R1) schema(R2)
Professors who have taught all courses
What about Courses that have been taught by all
faculty?

?fid (?fid,subj(Teaches ? COURSE) ?subj(COURSE))
7
Division

Goal Produce the tuples in one relation, r, that
match all tuples in another relation, s
r (A1, An, B1, Bm)
s (B1 Bm)
r/s, with attributes A1, An, is the set of all
tuples ltagt such that for every tuple ltbgt in s,
lta,bgt is in r
Can be expressed in terms of projection, set
difference, and cross-product

8
Division (contd)
9
Division - Example

List the Ids of students who have passed all
courses that were taught in spring 2000
Numerator
StudId and CrsCode for every course passed by
every student
?StudId, CrsCode (?Grade? F (Transcript) )
Denominator
CrsCode of all courses taught in spring 2000
?CrsCode (?SemesterS2000 (Teaching) )
Result is numerator/denominator

10
Relational Calculus

Important features
Declarative formal query languages for relational
model
Based on the branch mathematical logic known as
predicate calculus
Two types of RC
1) tuple relational calculus
2) domain relational calculus
A single statement can be used to perform a query

11
Tuple Relational Calculus

based on specifying a number of tuple variables
a tuple variable refers to any tuple

12
Generic Form

t COND (t)
where
t is a tuple variable and
COND(t) is Boolean expression involving t

13
Simple example 1

To find all employees whose salary is greater
than 50,000
t EMPLOYEE(t) and t.Salarygt5000
where
EMPLOYEE(t) specifies the range of tuple variable
t
The above operation selects all the attributes

14
Simple example 2

To find only the names of employees whose salary
is greater than 50,000
t.FNAME, t.NAME EMPLOYEE(t) and t.Salarygt5000
The above is equivalent to
SELECT T.FNAME, T.LNAME
FROM EMPLOYEE T
WHERE T.SALARY gt 5000

15
Elements of a tuple calculus

In general, we need to specify the following in a
tuple calculus expression
Range Relation (I.e, R(t)) FROM
Selected combination WHERE
Requested attributes SELECT

16
More ExampleQ0

Retrieve the birthrate and address of the
employee(s) whose name is John B. Smith
t.BDATE, t.ADDRESS EMPLOYEE(t) AND
t.FNAMEJohn AND t.MINITB AND
t.LNAMESmith

17
Formal Specification of tuple Relational Calculus

A general format
t1.A1, t2.A2,,tn.An COND ( t1 ,t2 ,, tn,
tn1, tn2,,tnm)
where
t1,,tnm are tuple var
Ai attribute?R(ti)
COND (formula)
Where COND corresponds to statement about the
world, which can be True or False

18
Elements of formula

A formula is made of Predicate Calculus atoms
an atom of the from R(ti)
ti.A op tj.B op?, lt,gt,..
F1 And F2 where F1 and F2 are formulas
F1 OR F2
Not (F1)
F(?t) (F) or F (?t) (F)
? Y friends (Y, John)
?X likes(X, ICE_CREAM)

19
Example Queries Using the Existential Quantifier

Retrieve the name and address of all employees
who work for the Research department
t.FNAME, t.LNAME, t.ADDRESS EMPLOYEE(t) AND (?
d) (DEPARTMENT (d) AND d.DNAMEResearch AND
d.DNUMBERt.DNO)

20
More Example

For every project located in Stafford, retrieve
the project number, the controlling department
number, and the last name, birthrate, and address
of the manger of that department.

21
Cont.

p.PNUMBER,p.DNUM,m.LNAME,m.BDATE,
m.ADDRESSPROJECT(p) and EMPLOYEE(M) and
P.PLOCATIONStafford and (? d) (DEPARTMENT(D)
AND P.DNUMd.DNUMBER and d.MGRSSNm.SSN))

22
Logical Equivalences

There are two logical equivalences that will be
heavily used
p?q ? ?p ? q (Whenever p is true, q must also
be true.)
?x. p(x) ? ??x. ?p(x) (p is true for all x)
The second can be a lot easier to check!

23
Normalization

Review on Keys
superkey a set of attributes which will uniquely
identify each tuple in a relation
candidate key a minimal superkey
primary key a chosen candidate key
secondary key all the rest of candiate keys
prime attribute an attribute that is a part of a
candidate key (key column)
nonprime attribute a nonkey column

24
Normalization

Functional Dependency Type by Keys
whole (candidate) key ? nonprime attribute
full FD (no violation)
partial key ? nonprime attribute partial FD
(violation of 2NF)
nonprime attribute ? nonprime attribute
transitive FD (violation of 3NF)
not a whole key ? prime attribute violation of
BCNF

25
Functional Dependencies

Let R be a relation schema
? ? R and ? ? R
The functional dependency
? ? ?holds on R iff for any legal relations
r(R), whenever two tuples t1 and t2 of r have
same values for ?, they have same values for ?.
t1? t2 ? ? t1? t2 ?
On this instance, A ? B does NOT hold, but B ? A
does hold.

A B

26
1. Closure

Given a set of functional dependencies, F, its
closure, F , is all FDs that are implied by FDs
in F.
e.g. If A ? B, and B ? C,
then clearly A ? C

27
Armstrongs Axioms

We can find F by applying Armstrongs Axioms
if ? ? ?, then ? ? ?
(reflexivity)
if ? ? ?, then ? ? ? ? ?
(augmentation)
if ? ? ?, and ? ? ?, then ? ? ? (transitivity)
These rules are
sound (generate only functional dependencies that
actually hold) and
complete (generate all functional dependencies
that hold).

28
Additional rules

If ? ? ? and ? ? ?, then ? ? ? ? (union)
If ? ? ? ?, then ? ? ? and ? ? ? (decomposition)
If ? ? ? and ? ? ? ?, then ? ? ? ?
(pseudotransitivity)
The above rules can be inferred from Armstrongs
axioms.

29
Example

R (A, B, C, G, H, I)F A ? B A ? C CG
? H CG ? I B ? H
Some members of F
A ? H
by transitivity from A ? B and B ? H
AG ? I
by augmenting A ? C with G, to get AG ? CG
and then transitivity with CG ? I
CG ? HI
by augmenting CG ? I to infer CG ? CGI,
and augmenting of CG ? H to infer CGI ? HI,
and then transitivity

30
2. Closure of an attribute set

Given a set of attributes A and a set of FDs F,
closure of A under F is the set of all attributes
implied by A
In other words, the largest B such that
A ? B
Redefining super keys
The closure of a super key is the entire
relation schema
Redefining candidate keys
1. It is a super key
2. No subset of it is a super key

31
Computing the closure for A

Simple algorithm
1. Start with B A.
2. Go over all functional dependencies, ? ? ? ,
in F
3. If ? ? B, then
Add ? to B
4. Repeat till B changes

32
Example

R (A, B, C, G, H, I)F A ? B A ? C CG
? H CG ? I B ? H
(AG) ?
1. result AG
2. result ABCG (A ? C and A ? B)
3. result ABCGH (CG ? H and CG ? AGBC)
4. result ABCGHI (CG ? I and CG ? AGBCH
Is (AG) a candidate key ?
1. It is a super key.
2. (A) BC, (G) G.
YES.

33
Support

Simplest question find sets of items that appear
frequently in the baskets.
Support for itemset I the number of baskets
containing all items in I.
Given a support threshold s, sets of items that
appear in gt s baskets are called frequent
itemsets.

34
Example

Itemsmilk, coke, pepsi, beer, juice.
Support 3 baskets.
B1 m, c, b B2 m, p, j B3 m, b
B4 c, j B5 m, p, b B6 m,
c, b, j
B7 c, b, j B8 b, c
Frequent itemsets m, c, b, j, m,
b, c, b, j, c.

35
Association Rules

Association rule R Itemset1 gt Itemset2
Itemset1, 2 are disjoint and Itemset2 is
non-empty
meaning if transaction includes Itemset1 then
it also has Itemset2
Examples
A,B gt E,C
A gt B,C

36
Example

B1 m, c, b B2 m, p, j
B3 m, b B4 c, j
B5 m, p, b B6 m, c, b, j
B7 c, b, j B8 b, c
An association rule m, b ? c.
Confidence 2/4 50.

_ _

37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
From Frequent Itemsets to Association Rules

Q Given frequent set A,B,E, what are possible
association rules?
A gt B, E
A, B gt E
A, E gt B
B gt A, E
B, E gt A
E gt A, B
__ gt A,B,E (empty rule), or true gt A,B,E

42
Classification vs Association Rules

Classification Rules
Focus on one target field
Specify class in all cases
Measures Accuracy

Association Rules
Many target fields
Applicable in some cases
Measures Support, Confidence, Lift

Apriori Algorithm (1997)
Let I a,b,c, be a set of all items in the
domain
Let T S S ? I be a bag of all transaction
records of item sets
Let support(S) ? A A ? T ? S ? A
Let L1 a a ? I ? support(a) ?
minSupport
?k (k gt 1 ? Lk-1 ? ?) Let
Lk Si ? Sj (Si ? Lk-1) ? (Sj ? Lk-1) ?
( Si Sj 1 ) ? (
Sj Si 1) ?
( ?S ((S ? Si ? Sj) ?
(S k-1)) ? S ? Lk-1 ) ?
( support(Si ? Sj) ?
minSupport )
The set of all frequent item sets is given by
L ?Lk
and the set of all association rules is
given by
R A ? C A ? ?(Lk) ? (C Lk A) ? (A ?
?) ? (C ? ?)
support(Lk) /
support(A) ? minConfidence

44
Rule Support and Confidence

Suppose R I gt J is an association rule
sup (R) sup (I ? J) is the support count
support of itemset I ? J (I or J)
conf (R) sup(J) / sup(R) is the confidence of R
fraction of transactions with I ? J that have J
Association rules with minimum support and count
are sometimes called strong rules

45
Association Rules Example

Q Given frequent set A,B,E, what association
rules have minsup 2 and minconf 50 ?
A, B gt E conf2/4 50
A, E gt B conf2/2 100
B, E gt A conf2/2 100
E gt A, B conf2/2 100
Dont qualify
A gtB, E conf2/6 33lt 50
B gt A, E conf2/7 28 lt 50
__ gt A,B,E conf 2/9 22 lt 50

46
Find Strong Association Rules

A rule has the parameters minsup and minconf
sup(R) gt minsup and conf (R) gt minconf
Problem
Find all association rules with given minsup and
minconf
First, find all frequent itemsets

47
Finding Frequent Itemsets

Start by finding one-item sets (easy)
Q How?
A Simply count the frequencies of all items

48
Finding itemsets next level

Apriori algorithm (Agrawal Srikant)
Idea use one-item sets to generate two-item
sets, two-item sets to generate three-item sets,
If (A B) is a frequent item set, then (A) and (B)
have to be frequent item sets as well!
In general if X is frequent k-item set, then all
(k-1)-item subsets of X are also frequent
Compute k-item set by merging (k-1)-item sets

49
(No Transcript)
50
Finding Association Rules

A typical question find all association rules
with support s and confidence c.
Note support of an association rule is the
support of the set of items it mentions.
Hard part finding the high-support (frequent )
itemsets.
Checking the confidence of association rules
involving those sets is relatively easy.

51
Naïve Algorithm

A simple way to find frequent pairs is
Read file once, counting in main memory the
occurrences of each pair.
Expand each basket of n items into its n (n
-1)/2 pairs.
Fails if items-squared exceeds main memory.

52
(No Transcript)
53
Filter
Filter
Construct
Construct
C1
L1
C2
L2
C3
First pass
Second pass
54
Agrawal, Srikant 94
Fast Algorithms for Mining Association Rules, by
Rakesh Agrawal and Ramakrishan Sikant, IBM
Almaden Research Center
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
C1
L1
Database
C2
L2
C2
C3
L3
C3
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68

Dynamic Programming Approach
Want proof of principle of optimality and
overlapping subproblems
Principle of Optimality
The optimal solution to Lk includes the
optimal solution of Lk-1
Proof by contradiction
Overlapping Subproblems
Lemma of every subset of a frequent item set is a
frequent item set
Proof by contradiction

69
(No Transcript)
70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
(No Transcript)
74
(No Transcript)
75
(No Transcript)
76
(No Transcript)
77
(No Transcript)
78
(No Transcript)
79
(No Transcript)
80
(No Transcript)
81
The Apriori Algorithm Example

Consider a database, D , consisting of 9
transactions.
Suppose min. support count required is 2 (i.e.
min_sup 2/9 22 )
Let minimum confidence required is 70.
We have to first find out the frequent itemset
using Apriori algorithm.
Then, Association rules will be generated using
min. support min. confidence.

82
Step 1 Generating 1-itemset Frequent Pattern
Compare candidate support count with minimum
support count
Scan D for count of each candidate
C1
L1

In the first iteration of the algorithm, each
item is a member of the set of candidate.
The set of frequent 1-itemsets, L1 , consists of
the candidate 1-itemsets satisfying minimum
support.

83
Step 2 Generating 2-itemset Frequent Pattern
Generate C2 candidates from L1
Compare candidate support count with minimum
support count
Scan D for count of each candidate
L2
C2
C2
84
Step 2 Generating 2-itemset Frequent Pattern
Cont.

To discover the set of frequent 2-itemsets, L2 ,
the algorithm uses L1 Join L1 to generate a
candidate set of 2-itemsets, C2.
Next, the transactions in D are scanned and the
support count for each candidate itemset in C2 is
accumulated (as shown in the middle table).
The set of frequent 2-itemsets, L2 , is then
determined, consisting of those candidate
2-itemsets in C2 having minimum support.
Note We havent used Apriori Property yet.

85
Step 3 Generating 3-itemset Frequent Pattern
Compare candidate support count with min support
count
Scan D for count of each candidate
Scan D for count of each candidate
C3
L3
C3

The generation of the set of candidate
3-itemsets, C3 , involves use of the Apriori
Property.
In order to find C3, we compute L2 Join L2.
C3 L2 Join L2 I1, I2, I3, I1, I2, I5,
I1, I3, I5, I2, I3, I4, I2, I3, I5, I2,
I4, I5.
Now, Join step is complete and Prune step will
be used to reduce the size of C3. Prune step
helps to avoid heavy computation due to large Ck.

86
Step 3 Generating 3-itemset Frequent Pattern
Cont.

Based on the Apriori property that all subsets of
a frequent itemset must also be frequent, we can
determine that four latter candidates cannot
possibly be frequent. How ?
For example , lets take I1, I2, I3. The 2-item
subsets of it are I1, I2, I1, I3 I2, I3.
Since all 2-item subsets of I1, I2, I3 are
members of L2, We will keep I1, I2, I3 in C3.
Lets take another example of I2, I3, I5 which
shows how the pruning is performed. The 2-item
subsets are I2, I3, I2, I5 I3,I5.
BUT, I3, I5 is not a member of L2 and hence it
is not frequent violating Apriori Property. Thus
We will have to remove I2, I3, I5 from C3.
Therefore, C3 I1, I2, I3, I1, I2, I5
after checking for all members of result of Join
operation for Pruning.
Now, the transactions in D are scanned in order
to determine L3, consisting of those candidates
3-itemsets in C3 having minimum support.

87
Step 4 Generating 4-itemset Frequent Pattern

The algorithm uses L3 Join L3 to generate a
candidate set of 4-itemsets, C4. Although the
join results in I1, I2, I3, I5, this itemset
is pruned since its subset I2, I3, I5 is not
frequent.
Thus, C4 f , and algorithm terminates, having
found all of the frequent items. This completes
our Apriori Algorithm.
Whats Next ?
These frequent itemsets will be used to generate
strong association rules ( where strong
association rules satisfy both minimum support
minimum confidence).

88
Step 5 Generating Association Rules from
Frequent Itemsets

Procedure
For each frequent itemset l, generate all
nonempty subsets of l.
For every nonempty subset s of l, output the rule
s ? (l-s) if
support_count(l) / support_count(s) gt min_conf
where min_conf is minimum confidence threshold.
Back To Example
We had L I1, I2, I3, I4, I5,
I1,I2, I1,I3, I1,I5, I2,I3, I2,I4,
I2,I5, I1,I2,I3, I1,I2,I5.
Lets take l I1,I2,I5.
Its all nonempty subsets are I1,I2, I1,I5,
I2,I5, I1, I2, I5.

89
Step 5 Generating Association Rules from
Frequent Itemsets Cont.

Let minimum confidence threshold is , say 70.
The resulting association rules are shown below,
each listed with its confidence.
R1 I1 I2 ? I5
Confidence scI1,I2,I5/scI1,I2 2/4 50
R1 is Rejected.
R2 I1 I5 ? I2
Confidence scI1,I2,I5/scI1,I5 2/2 100
R2 is Selected.
R3 I2 I5 ? I1
Confidence scI1,I2,I5/scI2,I5 2/2 100
R3 is Selected.

90
Step 5 Generating Association Rules from
Frequent Itemsets Cont.

R4 I1 ? I2 I5
Confidence scI1,I2,I5/scI1 2/6 33
R4 is Rejected.
R5 I2 ? I1 I5
Confidence scI1,I2,I5/I2 2/7 29
R5 is Rejected.
R6 I5 ? I1 I2
Confidence scI1,I2,I5/ I5 2/2 100
R6 is Selected.
In this way, We have found three strong
association rules.

91
Example
Simple algorithm
ABCDE
Large itemset
ACDE?B
ABCE?D
Rules with minsup
CDE?AB
BCE?AD
ABE?CD
ADE?BC
ACD?BE
ACE?BD
ACE?BD
ABC?ED
ABCDE
Fast algorithm
ACDE?B
ABCE?D
ACE?BD
92
(No Transcript)
93
(No Transcript)
94
(No Transcript)
95
(No Transcript)
96
(No Transcript)
97
(No Transcript)
98
(No Transcript)

Write a Comment

User Comments (0)