Title: Association Rule Mining III
1Association Rule Mining III
- COMP 790-90 Seminar
- BCB 713 Module
- Spring 2009
2Mining Various Kinds of Rules or Regularities
- Multi-level, quantitative association rules,
correlation and causality, ratio rules,
sequential patterns, emerging patterns, temporal
associations, partial periodicity - Classification, clustering, iceberg cubes, etc.
3Multiple-level Association Rules
- Items often form hierarchy
- Flexible support settings Items at the lower
level are expected to have lower support. - Transaction database can be encoded based on
dimensions and levels - explore shared multi-level mining
4Multi-dimensional Association Rules
- Single-dimensional rules
- buys(X, milk) ? buys(X, bread)
- MD rules ? 2 dimensions or predicates
- Inter-dimension assoc. rules (no repeated
predicates) - age(X,19-25) ? occupation(X,student) ?
buys(X,coke) - hybrid-dimension assoc. rules (repeated
predicates) - age(X,19-25) ? buys(X, popcorn) ? buys(X,
coke) - Categorical Attributes finite number of possible
values, no order among values - Quantitative Attributes numeric, implicit order
5Quantitative/Weighted Association Rules
Numeric attributes are dynamically
discretized maximize the confidence or
compactness of the rules 2-D quantitative
association rules Aquan1 ? Aquan2 ? Acat Cluster
adjacent association rules to form general
rules using a 2-D grid.
Income
age(X,33-34) ? income(X,30K - 50K) ?
buys(X,high resolution TV)
Age
6Mining Distance-based Association Rules
- Binning methods do not capture semantics of
interval data
- Distance-based partitioning
- Density/number of points in an interval
- Closeness of points in an interval
7Constraint-based Frequent-pattern Mining
- Why constraint-based mining?
- Anti-monotonicity, monotonicity succinctness
- Mining frequent patterns with convertible
constraints
8Constraint-based Data Mining
- Find all the patterns in a database autonomously?
- The patterns could be too many but not focused!
- Data mining should be interactive
- User directs what to be mined
- Constraint-based mining
- User flexibility provides constraints on what to
be mined - System optimization push constraints for
efficient mining
9Constraints in Data Mining
- Knowledge type constraint
- classification, association, etc.
- Data constraint using SQL-like queries
- find product pairs sold together in stores in New
York - Dimension/level constraint
- in relevance to region, price, brand, customer
category - Rule (or pattern) constraint
- small sales (price lt 10) triggers big sales (sum
gt200) - Interestingness constraint
- strong rules support and confidence
10Constrained Frequent Pattern Mining Query
Optimization
- Mining frequent patterns with constraint C
- Sound only find patterns satisfying the
constraints C - Complete find all patterns satisfying the
constraints C - A naïve solution
- Constraint test as a post-processing
- More efficient approaches
- Analyze the properties of constraints
comprehensively - Push constraints as deeply as possible inside the
frequent pattern mining
11A General Picture of Constraints
12Classification of Constraints
Monotone
Antimonotone
Strongly convertible
Succinct
Convertible anti-monotone
Convertible monotone
Inconvertible
13Sequential Pattern Mining
- Why sequential pattern mining?
- GSP algorithm
- FreeSpan and PrefixSpan
- Boarder Collapsing
- Constraints and extensions
14Sequence Databases and Sequential Pattern Analysis
- (Temporal) order is important in many situations
- Time-series databases and sequence databases
- Frequent patterns ? (frequent) sequential
patterns - Applications of sequential pattern mining
- Customer shopping sequences
- First buy computer, then CD-ROM, and then digital
camera, within 3 months. - Medical treatment, natural disasters (e.g.,
earthquakes), science engineering processes,
stocks and markets, telephone calling patterns,
Weblog click streams, DNA sequences and gene
structures
15What Is Sequential Pattern Mining?
- Given a set of sequences, find the complete set
of frequent subsequences
A sequence lt (ef) (ab) (df) c b gt
A sequence database
An element may contain a set of items. Items
within an element are unordered and we list them
alphabetically.
lta(bc)dcgt is a subsequence of lta(abc)(ac)d(cf)gt
Given support threshold min_sup 2, lt(ab)cgt is a
sequential pattern
16Challenges on Sequential Pattern Mining
- A huge number of possible sequential patterns are
hidden in databases - A mining algorithm should
- Find the complete set of patterns satisfying the
minimum support (frequency) threshold - Be highly efficient, scalable, involving only a
small number of database scans - Be able to incorporate various kinds of
user-specific constraints
17A Basic Property of Sequential Patterns Apriori
- A basic property Apriori (Agrawal Sirkant94)
- If a sequence S is not frequent
- Then none of the super-sequences of S is frequent
- E.g, lthbgt is infrequent ? so do lthabgt and lt(ah)bgt
Given support threshold min_sup 2
18Basic Algorithm Breadth First Search (GSP)
- L1
- While (ResultL ! NULL)
- Candidate Generate
- Prune
- Test
- LL1
19Finding Length-1 Sequential Patterns
- Initial candidates all singleton sequences
- ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt, ltggt, lthgt
- Scan database once, count support for candidates
min_sup 2
20The Mining Process
min_sup 2
21Generating Length-2 Candidates
51 length-2 Candidates
Without Apriori property, 8887/292 candidates
Apriori prunes 44.57 candidates
22Generating Length-4 Candidates
23Pattern Growth (prefixSpan)
- Prefix and Suffix (Projection)
- ltagt, ltaagt, lta(ab)gt and lta(abc)gt are prefixes of
sequence lta(abc)(ac)d(cf)gt - Given sequence lta(abc)(ac)d(cf)gt
24Example
An Example ( min_sup2)
25PrefixSpan (the example to be continued)
Step1 Find length-1 sequential patterns
ltagt4, ltbgt4, ltcgt4, ltdgt3, ltegt3, ltfgt3
support
pattern
Step2 Divide search space six
subsets according to the six prefixes
Step3 Find subsets of sequential patterns
By constructing corresponding projected
databases and mine each
recursively.
26Example to be continued
27Example
- Find sequential patterns having prefix ltagt
- Scan sequence database S once. Sequences in S
containing ltagt are projected w.r.t ltagt to form
the ltagt-projected database. - Scan ltagt-projected database once, get six
length-2 sequential patterns having prefix ltagt
- ltagt2 , ltbgt4, lt(_b)gt2, ltcgt4, ltdgt2, ltfgt2
- ltaagt2 , ltabgt4, lt(ab)gt2, ltacgt4, ltadgt2,
ltafgt2 - Recursively, all sequential patterns having
prefix ltagt can be further partitioned into 6
subsets. Construct respective projected databases
and mine each. - e.g. ltaagt-projected database has two
sequences - lt(_bc)(ac)d(cf)gt and lt(_e)gt.
28PrefixSpan Algorithm
Main Idea Use frequent prefixes to divide the
search space and to project sequence databases.
only search the relevant sequences.
PrefixSpan(?, i, S?)
- Scan S? once, find the set of frequent items b
such that - b can be assembled to the last element of ? to
form a sequential pattern or - ltbgt can be appended to ? to form a sequential
pattern. - For each frequent item b, appended it to ? to
form a sequential pattern ?, and output ? - For each ?, construct ?-projected database
S?, and call PrefixSpan(?, i1,S?).