Title: Frequent Pattern Queries with Constraints Language Algorithms
1 Pisa KDD Laboratory http//www-kdd.isti.cnr.it/
Frequent Pattern Querieswith Constraints(Langu
age Algorithms)
Francesco Bonchi, Fosca Giannotti, Dino
Pedreschi Workshop on Inductive Databases and
Constraint Based Mining 12/03/04
2Frequent Pattern Queries Language and
OptimizationsPh.D. Thesis
- Supervisors Dr. Fosca Giannotti, Prof. Dino
Pedreschi - International Reviewers Prof. Jean-Francois
Boulicaut, Prof. Jawei Han - Part 1 Data Mining Query Language
- Frequent Pattern Queries (FPQs) definition
- A language for FPQs
- Part 2 Optimizations
- Algorithms for FPQs
- (Pushing Monotone Constraint)
- Part 3 Conclusion
- Optimized Operational Semantics for FPQs
- (putting together Part 1 and Part 2)
3Plan of the talk
- Language for Frequent Pattern Queries
- Algorithms for Constrained Frequent Pattern
Mining - Adaptive Constraint Pushing Bonchi et al.
PKDD03 - ExAnte preprocessing Bonchi et al. PKDD03
- Further exploiting the ExAnte property
- ExAMiner (breadth-first) Bonchi et al.
ICDM03 - FP-bonsai (depth-first) Bonchi and Goethals
PAKDD04 - On going and future work P3D project
4Interesting feature of all our algorithms
- They provide the exact support for all solution
itemsets. - This feature distinguish our algorithms from
those ones presented by Luc (this morning) and by
Johannes (this afternoon). - In some sense they solve different problems
- Analogy with classical frequent itemsets without
constraints - Our algorithms ? frequent itemset mining
(Apriori) - Their algorithms ? maximal frequent itemset
mining (MaxMiner)
5Language for Frequent Pattern Queries
6Our Objective
- To study Frequent Pattern Queries optimizations
in a Logic-based Knowledge Discovery Support
Environment - a flexible knowledge discovery systems,
capable of obtaining, mantaining, representing
and using both induced and deduced knowledge in a
unified framework. - Need for such a system is suggested by the
real-world mining applications Bonchi et al.
KDD99 Giannotti et al. DMKD99 - A Deductive Database can easily represents both
extensional and intensional data. - Previous works by our group have shown that this
capability makes it viable for suitable
representation of domain knowledge and support of
the various steps of the KDD process. - LKDSE Deductive Database a set of inductive
primitives. - LKDSE Inductive Database where the DB component
is a Deductive DB.
7Logic-based Knowledge Discovery Support
Environment
(Inductive rules)
Closure Principle!
MINING
KNOWLEDGE
(Relational extensions) source data background
knowledge extracted knowledge
(Deductive rules)
(Deductive rules)
POSTPROCESSING Reasoning on extracted knowledge
PREPROCESSING Background knowledge integration
8Logic-based Knowledge Discovery Support
Environment
- The main issue for a deductive approach
- how to choose a suitable representation for
the inductive part? (In other words how to
define inductive mining queries?) - Mancos Ph.D. thesis
- inductive queries user-defined aggregates
- LDL-Mine
- LDL (deductive database language and system)
user-defined aggregates external calls to
implement mining primitives - Main drawback atomicity of the aggregate gives
no optimization opportunity. Boulicaut and
DeRaedt PKDD02 Tutorial - Example constraint pushing techiniques in
frequent itemsets mining would require the
analyst to create her own aggregates for any
different situation (conjunction of constraints).
9Our Vision Declarative Mining
- The analyst must have a high-level vision of the
knowledge discovery system, without worrying
about the details of the computational engine, in
the very same way a database designer has not to
worry about query optimizations. - She just needs to declaratively specify in the
inductive query how the desired patterns should
look like and which conditions they must satisfy
(a set of constraints). - It will be due to the query optimizer to compose
all constraints and to produce the most efficient
mining strategy (? execution plan) for the given
inductive query.
10Inductive Rule
- INDUCTIVE RULE a conjunction of sentences about
the desired patterns. - H ? B1,,Bn
- H is a relation representing the induced pattern.
- Sentences B1, Bn are taken from a restricted
class of sentences. - The set of all allowed sentences is just some
"syntactic sugar" on top of an algorithm (the
inductive engine). - Each sentence can be defined over some relations.
- Having a well defined and restricted set of
admitted sentences allow us to write higly
optimized algorithms to compute inductive rules
with any conjunction of sentences. - In particular we focus on Frequent Pattern
Queries
11Why Frequent Pattern Queries?
- Frequency of a pattern is the most important
interestingness measure. - Frequency has the right granularity to be a
primitive in a DMQL - its a simple low-level concept
- complex and time-consuming computation
- many different kinds of pattern are based on
frequency - frequent itemsets
- frequent sequences (or sequential pattern)
- frequent episodes
- frequent substructures in graph data
- can be used to define a whole class of data
mining tasks - Association Rules, Correlation Rules, Casuality
Rules, Ratio Rules, - Iceberg Queries and Iceberg Rules, Partial
Periodicity, Emerging Pattern, - Classification, Clustering
12Research Path Followed
- trying to define a language for FPQs expressive
enough to express the most of interesting
inductive queries, simple enough to be higly
optimized. - FPQ Definition ? identification of all basic
components of a FPQ - Syntax ? syntactic sugar to express all basic
components of a FPQ - Safety ? not all inductive rules derivable from
the provided grammar are meaningfull - Formal Semantics ? by showing that exists a
unique mapping from each safe FPQ (inductive
rule) of our language to a Datalog program (set
of deductive rules) with user-defined aggregates
(Mancos framework). Thanks to this mapping we
can define the formal declarative semantics of an
inductive rule as the iterated stable model of
the corresponding Datalog program. - Expressiveness ? by means of a suite of examples
of interesting complex queries.
13Inductive Query Example
Compute simple association rules, having exactly
2 items in the head and at least 3 items in the
body, creating transactions by grouping tuples by
day and customer, having support greater than
1000 and confidence more than 0.4, and spending
at least 50 in toys (total sum of prices of
items of type toys involved in the rule).
Inductive rule interesting_set(Set,Sup,Card,T) ?
Sup freq(Set,X), X ?I ?D,C??,
sales(D,C,I,Q), Sup ? 1000, Card card(Set),
J ? Set, T
sum(P,product(J,P,toy)).
Deductive (LDL) rule interesting_rules(L,R,Sup,C
onf) ? interesting_set(Set,Sup,Card,T), Card ? 5,
T ? 50, interesting_set(R,S1,2,T1),
subset(R,Set), difference(Set,R,L), Conf
Sup / S1, Conf ? 0.4.
14Algorithms for Constrained Frequent Pattern
Mining
15Why Constraints?
- Frequent pattern mining usually produces too many
solution patterns. This situation is harmful for
two reasons - Performance mining is usually inefficient or,
often, simply unfeasible - Identification of fragments of interesting
knowledge blurred within a huge quantity of
small, mostly useless patterns, is an hard task. - Constraints are the solution to both these
problems - they can be pushed in the frequent pattern
computation exploiting them in pruning the search
space, thus reducing time and resources
requirements - they provide to the user guidance over the mining
process and a way of focussing on the
interesting knowledge. - With constraints we obtain less patterns which
are more interesting. Indeed constraints are the
way we use to define what is interesting.
16Constrained Frequent Itemset Mining Problem
- Notation
- We indicate the frequency constraint with Cfreq
without explicitely indicating the dataset and
the min_sup - Given a constraint C , let Th(C) X C(X)
denote the set of all itemsets X that satisfy C. - The frequent itemsets mining problem requires to
compute Th(Cfreq) - The constrained frequent itemsets mining problem
requires to compute Th(Cfreq) ? Th(C).
17Problem Definition Anti-monotone Constraint
- Frequency is an anti-monotone constraint.
- "Apriori trick" if an itemset X does not
satisfy Cfreq, then no superset of X can satisfy
Cfreq. - Other examples of anti-monotone constraint
- sum(X.prices) ? 20 euro
X ? 5
18Problem Definition Monotone Constraint
19Our Problem
to compute itemsets which satisfy a conjunction
of anti-monotone and monotone
constraints.
- Why Monotone Constraints?
- Theyre the most useful in order to discover
local high-value patterns (for instance very
expansive or very large itemsets which can be
found only with a very small min-sup) - We know how to exploit the other kinds of
constraints (antimonotone, succinct) since 98
Ng et al. SIGMOD98, while for monotone
constraints the situation is more complex
20Search Space Characterization
21Problem Characterization
22Tradeoff between AM and M pruning
23The 2 extremes priority to AM
-
- Strategy GT (Generate and Test)
- Apriori level-wise computation
followed by the test of CM - Maximum possible antimonotone pruning. No
monotone pruning.
24The 2 extremes priority to M
- Strategy MCP (Monotone Constraint Pushing)
- First finds B(CM). Then only itemsets over
the border are generated as candidates to be
tested for frequency. - Candidate generation function generate_over
Boulicaut and Jeudy 2000 add to each solution
at the previous level 1 item. - Antimonotone pruning possible only partially.
- Maximum possible monotone pruning. Little
antimonotone pruning.
25Strategies Analysis (w.r.t. frequency tests
performed)
GT MCP
Solutions
Not prunable
Ideal
AM prunable
M prunable
M prunable
AM and M prunable
B-(Cfreq )
No one of the two strategies outpermorfs the
other on every input problem
26Adaptive Constraint Pushing
27Adaptive Constraint Pushing
- ACP explores a portion of Th(? CM) looking for
infrequent itemsets (negative border of
frequency). - Each infrequent itemset found (region 5) will
help in AM-pruning the search space in Th(CM)
(in particular region 3). - Each infrequent itemset lost (region 5) will
induce exploration of a portion of region 6. - Each itemset found frequent (region 4) will be
just a useless frequency test performed. - 2 questions
- What is a good portion of candidates?
- How large this portion should be?
- Itemsets which have higher probability to be
found infrequent
- This is what the adaptivity of ACP is about
28Adaptivity Parameter
- We introduce a parameter ? ? 0,1 which
represents the fraction of candidates under the
monotone border to be chosen among all possible
candidates under the monotone border. - Initialized after the first scan TDB using all
information available. - Example
- Number of transactions in TDB
- Total number of 1-itemsets (singleton items)
- Number of frequent 1-itemsets
- Number of 1-itemsets satisfying the monotone
constraint - Ect...
- Updated level by level with the newly collected
knowledge. - Updating ? ACP adapts its behaviour to the given
input TDB and constraints ... - Extreme cases
- If ? 0 constantly then ACP ? MCP
- If ? 1 constantly then ACP ? GT
29The Algorithm
- Notation at iteration k (k-itemsets)
- Pk Set of itemsets whose proper subsets do not
satisfy CM and have not been found infrequent - Bk subset of Pk containing itemsets which
satisfy CM (positive monotone border) - Ek Pk\ Bk
- CkO candidates over the monotone border
- CkU candidates under the monotone border
- Rk solutions
- Nk itemsets under the monotone border found
infrequent - N union of all Nk
30The Algorithm
generate_over
Rk-1
CkO
Rk
N
CkU
Bk
Pk
Satisfy CM ?
Ek
?
generate_apriori
Pk1
31The Algorithm
32? selection and adaptivity
- How candidates are selected by ??
- Among all itemsets in Ek the ?-portion with
lowest estimated support is selected to enter in
CkU. - Support is estimated using the real support
values of singleton items belonging to the
itemset, balancing between complete independence
(products of values) and maximal correlation
(minimum value). - How does ? adapt itself?
- According to the performance of the ?-selection
at the previous iteration - ?-focus Nk / CkU
- An ?-focus very close to 1 ? very good selection
? probabily ? is selecting too few candidates ?
it risks to lose some infrequent itemsets ? ?
value is raised accordingly - A low ?-focus ? poor selection ? ? is selecting
too much candidates ? ? value is reduced
accordingly
33Experimental Results
34ExAnte(a preprocessing algorithm)
35AM Vs. M
- State of art before ExAnte when dealing with a
conjunction of AM and M constraints we face a
tradeoff between AM and M pruning. - Tradeoff pushing M constraints into the
computation can help pruning the search space,
but at the same time can lead to a reduction of
AM pruning opportunities. - Our observation this is true only if we focus
exclusively on the search space of itemsets.
Reasoning on both the search space and the input
TDB together we can find the real sinergy of AM
and M pruning. - The real sinergy do not exploit M constraints
directly to prune the search space, but use them
to reduce the data, which in turn induces a much
stronger pruning of the search space. - The real sinergy of AM and M pruning lies in Data
Reduction
36ExAnte ?-reduction
- Definition ?-reduction
- Given a transaction database TDB and a monotone
constraint CM, we define the ?-reduction of TDB
as the dataset resulting from pruning the
transactions that do not satisfy CM. - Example CM ? sum(X.price) ? 55
37ExAnte Property
38ExAnte ?-reduction
- ?-reducing a transaction means to delete from the
transaction infrequent singleton items (or more
generally singleton items which do not satisfy a
given anti-monotone constraint) - ?-reducing a database of transaction means to
?-reduce all transaction in the database. - ?-reducing a database is correct, i.e. it does
not change support to solution itemsets (trivial
by anti-monotonicity) -
39A Fix-Point Computation
Shorter transactions
Less transactions which satisfy CM
TDB
Less frequent 1-itemsets
until a fix-point is reached
Less Transactions in TDB
40ExAnte Preprocessing Example
- Min_sup 4
- CM ? sum(X.price) ? 45
4 7 5 7 4 3 6 2
4 4 4
X
X
X
X
X
41Experimental Results Data Reduction
42Experimental Results Items Reduction
43Experimental Results Search Space Reduction
44Experimental Results Runtime Comparison
45Further Exploiting the ExAnte Property
- ExAMiner
- ExAnte Miner (in contrast with ExAnte
preprocessor) - Miner which Exploits Antimonotone and Monotone
constraints togheter - Basic idea to generalize ExAnte data reduction
at all levels of a level-wise Apriori-like
computation. - Performs better on sparse datasets.
- Breadth-first
-
- FP-Bonsai the art of growing and pruning small
FP-Trees - Basic idea embedding ExAnte data-reduction in
FP-growth computation. - Performs very well on both dense and sparse
datasets. - Depth-first
- ExAnte property works even in other pattern
domains (sequences, trees, graphs)
46P3D Projecthttp//www-kdd.isti.cnr.it/p3d/
an ISTI C.N.R. internal curiosity driven
project
Pisa KDD Laboratory
High Performance Computing Laboratory ( DCI
people Salvatore Orlando, Raffaele Perego and
others)
47Activities
Patternist devising knowledge discovery support
environment focused on frequent pattern
discovery which offers the repertoire of
algorithms studied and implemented by the
researchers participating to the project in the
last few years PDQL devising a highly
expressive query language for frequent pattern
discovery PPDM devising privacy-preserving
methods for frequent pattern discovery from
sources that typically contain personal sensitive
data Applications devising some benchmarking
test bed in the domain of biological data
developed within the above environment. Other
kinds of pattern closed-frequent itemsets,
sequential patterns, graph-based frequent
patterns ...
48What about constraint-based frequent itemset
mining _at_ FIMI04?