Association Rules Mining with SQL

About This Presentation

Title:

Association Rules Mining with SQL

Description:

exec sql select allocSpace() into :blob from onerecord; ... Ck is converted into a BLOB and passed as an argument to the table function. ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 72

Provided by: kirstenru

Category:

more less

Transcript and Presenter's Notes

Title: Association Rules Mining with SQL

1
Association Rules Mining with SQL

Kirsten Nelson
Deepen Manek
November 24, 2003

2
Organization of Presentation

Overview Data Mining and RDBMS
Loosely-coupled data and programs
Tightly-coupled data and programs
Architectural approaches
Methods of writing efficient SQL
Candidate generation, pruning, support counting
K-way join, SubQuery, GatherJoin, Vertical,
Hybrid
Integrating taxonomies
Mining sequential patterns

3
Early data mining applications

Most early mining systems were developed largely
on file systems, with specialized data structures
and buffer management strategies devised for each
All data was read into memory before beginning
computation
This limits the amount of data that can be mined

4
Advantage of SQL and RDBMS

Make use of database indexing and query
processing capabilities
More than a decade spent on making these systems
robust, portable, scalable, and concurrent
Exploit underlying SQL parallelization
For long-running algorithms, use checkpointing
and space management

5
Organization of Presentation

Overview Data Mining and RDBMS
Loosely-coupled data and programs
Tightly-coupled data and programs
Architectural approaches
Methods of writing efficient SQL
Candidate generation, pruning, support counting
K-way join, SubQuery, GatherJoin, Vertical,
Hybrid
Integrating taxonomies
Mining sequential patterns

6
Use of Database in Data Mining

Loose coupling of application and data
How would you write an Apriori program?
Use SQL statements in an application
Use a cursor interface to read through records
sequentially for each pass
Still two major performance problems
Copying of record from database to memory
Process context switching for each record
retrieved

7
Organization of Presentation

Overview Data Mining and RDBMS
Loosely-coupled data and programs
Tightly-coupled data and programs
Architectural approaches
Methods of writing efficient SQL
Candidate generation, pruning, support counting
K-way join, SubQuery, GatherJoin, Vertical,
Hybrid
Integrating taxonomies
Mining sequential patterns

8
Tightly-coupled applications

Push computations into the database system to
avoid performance degradation
Take advantage of user-defined functions (UDFs)
Does not require changes to database software
Two types of UDFs we will use
Ones that are executed only a few times,
regardless of the number of rows
Ones that are executed once for each selected row

9
Tight-coupling using UDFs

Procedure TightlyCoupledApriori()
begin
exec sql connect to database
exec sql select allocSpace() into blob from
onerecord
exec sql select from sales where GenL1(blob,
TID, ITEMID) 1
notDone true

10
Tight-coupling using UDFs

while notDone do
exec sql select aprioriGen(blob)
into blob from onerecord
exec sql select
from sales
where itemCount(blob, TID,
ITEMID)1
exec sql select GenLk(blob) into notDone from
onerecord

11
Tight-coupling using UDFs

exec sql select getResult(blob) into resultBlob
from onerecord
exec sql select deallocSpace(blob) from
onerecord
compute Answer using resultBlob
end

12
Organization of Presentation

Overview Data Mining and RDBMS
Loosely-coupled data and programs
Tightly-coupled data and programs
Architectural approaches
Methods of writing efficient SQL
Candidate generation, pruning, support counting
K-way join, SubQuery, GatherJoin, Vertical,
Hybrid
Integrating taxonomies
Mining sequential patterns

13
Methodology

Comparison done with Association Rules against
IBM DB2
Only consider generation of frequent itemsets
using Apriori algorithm
Five alternatives considered
Loose-coupling through SQL cursor interface as
described earlier
UDF tight-coupling as described earlier
Stored-procedure to encapsulate mining algorithm
Cache-mine caching data and mining on the fly
SQL implementations to force processing in the
database
Consider two classes of implementations
SQL-92 four different implementations
SQL-OR (with object relational extensions) six
implementations

14
Architectural Options

Stored procedure
Apriori algorithm encapsulated as a stored
procedure
Implication runs in the same address space as
the DBMS
Mined results stored back into the DBMS.
Cache-mine
Variation of stored-procedure
Read entire data once from DBMS, temporarily
cache data in a lookaside buffer on a local disk
Cached data is discarded when execution completes
Disadvantage requires additional disk space for
caching
Use Intelligent Miners space option

15
Organization of Presentation

Overview Data Mining and RDBMS
Loosely-coupled data and programs
Tightly-coupled data and programs
Architectural approaches
Methods of writing efficient SQL
Candidate generation, pruning, support counting
K-way join, SubQuery, GatherJoin, Vertical,
Hybrid
Integrating taxonomies
Mining sequential patterns

16
Terminology

Use the following terminology
T table of items
tid,item pairs
Data is normally sorted by transaction id
Ck candidate k-itemsets
Obtained from joining and pruning frequent
itemsets from previous iteration
Fk frequent items sets of length k
Obtained from Ck and T

17
Candidate Generation in SQL join step

Generate Ck from Fk-1 by joining Fk-1 with itself
insert into Ck select I1.item1,,I1.itemk-1,I2.ite
mk-1
from Fk-1 I1,Fk-1 I2
where I1.item1 I2.item1 and
I1.itemk-2 I2.itemk-2 and
I1.itemk-1 lt I2.itemk-1

18
Candidate Generation Example

F3 is 1,2,3,1,2,4,1,3,4,1,3,5,2,3,4
C4 is 1,2,3,4,1,3,4,5

Table F3 (I1)
Table F3 (I2)
item1 item2 item3
1 2 3
1 2 4
1 3 4
1 3 5
2 3 4
item1 item2 item3
1 2 3
1 2 4
1 3 4
1 3 5
2 3 4
19
Pruning

Modify candidate generation algorithm to ensure
all k subsets of Ck of length (k-1) are in Fk-1
Do a k-way join, skipping itemn-2 when joining
with the nth table (2ltnk)
Create primary index (item1, , itemk-1) on Fk-1
to efficiently process k-way join
For k4, this becomes
insert into C4 select I1.item1, I1.item2,
I1.item3,I2.item3 from F3 I1,F3 I2,
F3 I3, F3 I4 where I1.item1 I2.item1 and
I1.item3 lt I2.item3 and
I1.item2 I3.item1 and I1.item3 I3.item2 and
I2.item3 I3.item3 and
I1.item1 I4.item1 and I1.item3 I4.item2 and
I2.item3 I4.item3

20
Pruning Example

Evaluate join with I3 using previous example
C4 is 1,2,3,4

Table F3 (I1)
Table F3 (I2)
Table F3 (I3)
item1 item2 item3
1 2 3
1 2 4
1 3 4
1 3 5
2 3 4
item1 item2 item3
1 2 3
1 2 4
1 3 4
1 3 5
2 3 4
item1 item2 item3
1 2 3
1 2 4
1 3 4
1 3 5
2 3 4
21
Support counting using SQL

Two different approaches
Use the SQL-92 standard
Use standard SQL syntax such as joins and
subqueries to find support of itemsets
Use object-relational extensions of SQL (SQL-OR)
User Defined Functions (UDFs) table functions
Binary Large Objects (BLOBs)

22
Support Counting using SQL-92

4 different methods, two of which detailed in the
papers
K-way Joins
SubQuery
Other methods not discussed because of
unacceptable performance
3-way join
2 Group-Bys

23
SQL-92 K-way join

Obtain Fk by joining Ck with table T of
(tid,item)
Perform group by on the itemset
insert into Fk select item1,,itemk,count()
from Ck, T t1, , T tk,
where t1.item Ck.item1, , and
tk.item Ck.itemk and
t1.tid t2.tid and
tk-1.tid tk.tid
group by item1,,itemk
having count() gt minsup

24
K-way join example

C3B,C,E and minimum support required is 2
Insert into F3 B,C,E,2

25
K-way join Pass-2 optimization

When calculating C2, no pruning is required after
we join F1 with itself
Dont calculate and materialize C2- replace C2 in
2-way join algorithm with join of F1 with itself
insert into F2 select I1.item1, I2.item1,count()
from F1 I1, F1 I2, T t1, T t2
where I1.item1 lt I2.item1 and
t1.item I1.item1 and t2.item I2.item1 and
t1.tid t2.tid
group by I1.item1,I2.item1
having count() gt minsup

26
SQL-92 SubQuery based

Split support counting into cascade of k
subqueries
nth subquery Qn finds all tids that match the
distinct itemsets formed by the first n items of
Ck
insert into Fk select item1, , itemk, count()
from (Subquery Qk) t
Group by item1, item2 , itemk having count() gt
minsup
Subquery Qn (for any n between 1 and k)
select item1, , itemn, tid
from T tn, (Subquery Qn-1) as rn-1
(select distinct item1, , itemn from CK) as dn
where rn-1.item1 dn.item1 and and
rn-1.itemn-1 dn.itemn
and rn-1.tid tn.tid and tn.item dn.itemn

27
Example of SubQuery based

Using previous example from class
C3 B,C,E, minimum support 2
Q0 No subquery Q0
Q1 in this case becomes
select item1, tid
From T t1,
(select distinct item1from C3) as d1
where t1.item d1.item1

28
Example of SubQuery based cntd

Q2 becomes
select item1, item2, tid from T t2, (Subquery Q1)
as r1,
(select distinct item1, item2 from C3) as d2
where r1.item1 d2.item1 and r1.tid t2.tid and
t2.item d2.item2

29
Example of SubQuery based cntd

Q3 becomes
select item1,item2,item3, tid from T t3,
(Subquery Q2) as r2,
(select distinct item1,item2,item3 from C3) as d3
where r2.item1 d3.item1 and r2.item2 d3.item2
and
r2.tid t3.tid and t3.item d3.item3

30
Example of SubQuery based cntd

Output of Q3 is
Insert statement becomes
insert into F3 select item1, item2, item3,
count()
from (Subquery Q3) t
group by item1, item2 ,item3 having count() gt
minsup
Insert the row B,C,E,2
For Q2, pass-2 optimization can be used

31
Performance Comparisons of SQL-92 approaches

Used Version 5 of DB2 UDB and RS/6000 Model 140
200 Mhz CPU, 256 MB main memory, 9 GB of disk
space, Transfer rate of 8 MB/sec
Used 4 different item sets based on real-world
data
Built the following indexes, which are not
included in any cost calculations
Composite index (item1, , itemk) on Ck
k different indices on each of the k items in Ck
(item,tid) and (tid,item) indexes on the data
table T

32
Performance Comparisons of SQL-92 approaches

Best performance obtained by SubQuery approach
SubQuery was only comparable to loose-coupling in
some cases, failing to complete in other cases
DataSet C, for support of 2, SubQuery
outperforms loose-coupling but decreasing support
to 1, SubQuery takes 10 times as long to
complete
Lower support will increase the size of Ck and Fk
at each step, causing the join to process more
rows

33
Support Counting using SQL with
object-relational extensions

6 different methods, four of which detailed in
the papers
GatherJoin
GatherCount
GatherPrune
Vertical
Other methods not discussed because of
unacceptable performance
Horizontal
SBF

34
SQL Object-Relational Extension GatherJoin

Generates all possible k-item combinations of
items contained in a transaction and joins them
with Ck
An index is created on all items of Ck
Uses the following table functions
Gather Outputs records tid,item-list, with
item-list being a BLOB or VARCHAR containing all
items associated with the tid
Comb-K returns all k-item combinations from the
transaction
Output has k attributes T_itm1, , T_itmk

35
GatherJoin

insert into Fk select item1,, itemk, count()
from Ck,
(select t2.T_itm1,,t2.itmk from T,
table(Gather(T.tid,T.item)) as t1,
table(Comb-K(t1.tid,t1.item-list)) as t2)
where t2.T_itm1 Ck.item1 and and
t2.T_itmk Ck.itemk
group by Ck.item1,,Ck.itemk
having count() gt minsup

36
Example of GatherJoin

t1 (output from Gather) looks like
t2 (generated by Comb-K from t1) will be joined
with C3 to obtain F3
1 row from Tid 10
1 row from Tid 20
4 rows from Tid 30
Insert B,C,E,2

37
GatherJoin Pass 2 optimization

When calculating C2, no pruning is required after
we join F1 with itself
Dont calculate and materialize C2 - replace C2
with a join to F1 before the table function
Gather is only passed frequent 1-itemset rows
insert into F2 select I1.item1, I2.item1,
count() from F1 I1,
(select t2.T_itm1,t2.T_itm2 from T,
table(Gather(T.tid,T.item)) as t1,
table(Comb-K(t1.tid,t1.item-list)) as t2 where
T.item I1.item1)
group by t2.T_itm1,t2.T_itm2
having count() gt minsup

38
Variations of GatherJoin - GatherCount

Perform the GROUP BY inside the table function
Comb-K for pass 2 optimization
Output of the table function Comb-K
Not the candidate frequent itemsets (Ck)
But the actual frequent itemsets (Fk) along with
the corresponding support
Use a 2-dimensional array to store possible
frequent itemsets in Comb-K
May lead to excessive memory use

39
Variations of GatherJoin - GatherPrune

Push the join with Ck into the table function
Comb-K
Ck is converted into a BLOB and passed as an
argument to the table function.
Will have to pass the BLOB for each invocation of
Comb-K - of rows in table T

40
SQL Object-Relational Extension Vertical

For each item, create a BLOB containing the tids
the item belongs to
Use function Gather to generate item,tid-list
pairs, storing results in table TidTable
Tid-list are all in the same sorted order
Use function Intersect to compare two different
tid-lists and extract common values
Pass-2 optimization can be used for Vertical
Similar to K-way join method
Upcoming example does not show optimization

41
Vertical

insert into Fk select item1, , itemk,
count(tid-list) as cnt
from (Subquery Qk) t where cnt gt minsup
Subquery Qn (for any n between 2 and k)
Select item1, , itemn,
Intersect(rn-1.tid-list, tn.tid-list) as tid-list
from TidTable tn, (Subquery Qn-1) as rn-1
(select distinct item1, , itemn from CK) as dn
where rn-1.item1 dn.item1 and and
rn-1.itemn-1 dn.itemn-1 and
tn.item dn.itemn
Subquery Q1 (select from TidTable)

42
Example of Vertical

Using previous example from class
C3 B,C,E, minimum support 2
Q1 is TidTable

43
Example of Vertical cntd

Q2 becomes
Select item1, item2, Intersect(r1.tid-list,
t2.tid-list) as tid-list
from TidTable t2, (Subquery Q1) as r1
(select distinct item1, item2 from C3) as d2
where r1.item1 d2.item1 and t2.item d2.item2

44
Example of Vertical cntd

Q3 becomes
select item1, item2, item3, intersect(r2.tid-list,
t3.tid-list) as tid-list
from TidTable t3, (Subquery Q2) as r2
(select distinct item1, item2, item3 from C3) as
d3
where r2.item1 d3.item1 and r2.item2 d3.item2
and
t3.item d3.item3

45
Performance Comparisons using SQL-OR
46
Performance Comparisons using SQL-OR
47
Performance comparison of SQL object-relational
approaches

Vertical has best overall performance, sometimes
an order of magnitude better than other 3
approaches
Majority of time is transforming the data in
item,tid-list pairs
Vertical spends too much time on the second pass
Pass-2 optimization has huge impact on
performance of GatherJoin
For Dataset-B with support of 0.1 , running time
for Pass 2 went from 5.2 hours to 10 minutes
Comb-K in GatherJoin generates large number of
potential frequent itemsets we must work with

48
Hybrid approach

Previous charts and algorithm analysis show
Vertical spends too much time on pass 2 compared
to other algorithms, especially when the support
is decreased
GatherJoin degrades when the of frequent items
per transaction increases
To improve performance, use a hybrid algorithm
Use Vertical for most cases
When size of candidate itemset is too large,
GatherJoin is a good option if number of frequent
items per transaction (Nf) is not too large
When Nf is large, GatherCount may be the only
good option

49
Architecture Comparisons

Compare five alternatives
Loose-Coupling, Stored-procedure
Basically the same except for address space
program is being run in
Because of limited difference in performance,
focus solely on stored procedure in following
charts
Cache-Mine
UDF tight-coupling
Best SQL approach (Hybrid)

50
Performance Comparisons of Architectures
51
Performance Comparisons of Architectures cntd
52
Performance Comparisons of Architectures cntd

Cache-Mine is the best or close to the best
performance in all cases
Factor of 0.8 to 2 times faster than SQL approach
Stored procedure is the worst
Difference between Cache-Mine directly related to
the number of passes through the data
Passes increase when the support goes down
May need to make multiple passes if all
candidates cannot fit in memory
UDF time per pass decreases 30-50 compared to
stored procedure because of tighter coupling with
DB

53
Performance Comparisons of Architectures cntd

SQL approach comes in second in performance to
Cache-Mine
Somewhat better than Cache-Mine for high support
values
1.8 3 times better than Stored-procedure/loose-c
oupling approach, getting better when support
value decreases
Cost of converting to Vertical format is less
than cost of converting to binary format in
Cache-Mine
For second pass through data, SQL approach takes
much more time than Cache-Mine, particularly when
we decrease the support

54
Organization of Presentation

Overview Data Mining and RDBMS
Loosely-coupled data and programs
Tightly-coupled data and programs
Architectural approaches
Methods of writing efficient SQL
Candidate generation, pruning, support counting
K-way join, SubQuery, GatherJoin, Vertical,
Hybrid
Integrating taxonomies
Mining sequential patterns

55
Taxonomies - example
Beverages
Snacks
Soft Drinks
Alcoholic Drinks
Pretzels
Chocolate Bar
Parent Child
Beverages Soft Drinks
Beverages Alcoholic Drinks
Soft Drinks Pepsi
Soft Drinks Coke
Alcoholic Drinks Beer
Snacks Pretzels
Snacks Chocolate Bar
Pepsi
Coke
Beer
Example rule Soft Drinks ? Pretzels with 30
confidence, 2 support
56
Taxonomy augmentation

Algorithms similar to previous slides
Requires two additions to algorithm
Pruning itemsets containing an item and its
ancestor
Pre-computing the ancestors for each item
Will also consider support counting

57
Pruning items and ancestors

In the second pass we will join F1 with F1 to
give C2
This will give, for example
beverages,pepsi
snacks,coke
pretzels,chocolate bar
But beverages,pepsi is redundant!

58
Pruning items and ancestors

The following modification to the SQL statement
eliminates such redundant combinations from being
selected
insert into C2 (select I1.item1, I2.item1 from F1
I1, F1 I2
where I1.item1 lt I2.item1) except
(select ancestor, descendant from Ancestor union
select descendant, ancestor from Ancestor)

59
Pre-computing ancestors

An ancestor table is created
Format (ancestor, descendant)
Use the transitive closure operation
insert into Ancestor with R-Tax (ancestor,
descendant) as
(select parent, child from Tax union all
select p.ancestor, c.child from R-Tax p, Tax c
where p.descendant c.parent)
select ancestor, descendant from R-Tax

60
Support Counting

Extensions to handle taxonomies
Straightforward, but
Non-trivial
Need an extended transaction table
For example, if we have coke, pretzels
We add also soft drinks, pretzels, beverages,
pretzels, coke, snacks, soft drinks, snacks,
beverages, snacks

61
Extended transaction table

Can be obtained by the following SQL
Query to generate T
select item, tid from T union
select distinct A.ancestor as item, T.tid
from T, Ancestor A
where A.descendant T.item
The select distinct clause gets rid of items
with common ancestor e.g. dont want
beverages, beverages being added twice from
pepsi, coke

62
Pipelining of Query

No need to actually build T
Make following modification to SQL

insert into Fk with T(tid, item) as (Query for
T) select item1,,itemk,count() from Ck, T t1,
, T tk, where t1.item Ck.item1, ,
and tk.item Ck.itemk and t1.tid t2.tid and
tk-1.tid tk.tid group by item1,,itemk having
count() gt minsup
63
Organization of Presentation

Overview Data Mining and RDBMS
Loosely-coupled data and programs
Tightly-coupled data and programs
Architectural approaches
Methods of writing efficient SQL
Candidate generation, pruning, support counting
K-way join, SubQuery, GatherJoin, Vertical,
Hybrid
Integrating taxonomies
Mining sequential patterns

64
Sequential patterns

Similar to papers covered on Nov 17
Input is sequences of transactions
E.g. ((computer,modem),(printer))
Similar to association rules, but dealing with
sequences as opposed to sets
Can also specify maximum and minimum time gaps,
as well as sliding time windows
Max-gap, min-gap, window-size

65
Input and output formats

Input has three columns
Sequence identifier (sid)
Transaction time (time)
Idem identifier (item)
Output format is a collection of frequent
sequences, in a fixed-width table
(item1, eno1,,itemk, enok, len)
For smaller lengths, extra column values are set
to NULL

66
GSP algorithm

Similar to algorithms shown earlier
Each Ck has transactions and times, but no length
has fixed length of k
Candidates are generated in two steps
Join join Fk-1 with itself
Sequence s1 joins with s2 if the subsequence
obtained by dropping the first item of s1 is the
same as the one obtained by dropping the last
item of s2
When generating C2, we need to generate sequences
where both of the items appear as a single
element as well as two separate elements
Prune
All candidate sequences that have a non-frequent
contiguous (k-1) subsequence are deleted

67
GSP Join SQL

insert into Ck
select I1.item1, I1.eno1, ... , I1.itemk-1,
I1.enok-1,
I2.itemkk-1, I1.enok-1 I2.enok-1 I2.enok-2
from Fk-1 I1, Fk-1 I2
where I1.item2 I2.item1 and ... and
I1.itemk-1 I2.itemk-2 and
I1.eno3-I1.eno2 I2.eno2 I2.eno1 and ... and
I1.enok-1 I1.enok-2 I2.enok-2 I2.enok-3

68
GSP Prune SQL

Write as a k-way join, similar to before
There are at most k contiguous subsequences of
length (k-1) for which Fk-1 needs to be checked
for membership
Note that all (k-1) subsequences may not be
contiguous because of the max-gap constraint
between consecutive elements.

69
GSP Support Counting

In each pass, we use the candidate table Ck and
the input data-sequences table D to count the
support
K-way join
We use select distinct before the group by to
ensure that only distinct data-sequences are
counted
We have additional predicates between sequence
numbers to handle the special time elements

70
GSP Support Counting SQL

(Ck.enoj Ck.enoi and abs(dj.time di.time)
window-size) or (Ck.enoj Ck.enoi 1 and
dj.time di.time max-gap and dj.time di.time gt
min-gap) or (Ck.enoj gt Ck.enoi 1)

71
References

Developing Tightly-Coupled Data Mining
Applications on a Relational Database System
Rakesh Agrawal, Kyuseok Shim, 1996
Integrating Association Rule Mining with
Relational Database Systems Alternatives and
Implications
Sunita Sarawagi, Shiby Thomas, Rakesh Agrawal,
1998
Refers to 1) above
Mining Generalized Association Rules and
Sequential Patterns Using SQL Queries
Shiby Thomas, Sunita Sarawagi, 1998
Refers to 1) and 2) above