Frequent Pattern Queries with Constraints Language Algorithms

About This Presentation

Title:

Frequent Pattern Queries with Constraints Language Algorithms

Description:

(Language Algorithms) Francesco Bonchi, Fosca Giannotti, Dino Pedreschi ... Supervisors: Dr. Fosca Giannotti, Prof. Dino Pedreschi ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 49

Provided by: maril85

Category:

more less

Transcript and Presenter's Notes

Title: Frequent Pattern Queries with Constraints Language Algorithms

1
Pisa KDD Laboratory http//www-kdd.isti.cnr.it/
Frequent Pattern Querieswith Constraints(Langu
age Algorithms)
Francesco Bonchi, Fosca Giannotti, Dino
Pedreschi Workshop on Inductive Databases and
Constraint Based Mining 12/03/04
2
Frequent Pattern Queries Language and
OptimizationsPh.D. Thesis

Supervisors Dr. Fosca Giannotti, Prof. Dino
Pedreschi
International Reviewers Prof. Jean-Francois
Boulicaut, Prof. Jawei Han
Part 1 Data Mining Query Language
Frequent Pattern Queries (FPQs) definition
A language for FPQs
Part 2 Optimizations
Algorithms for FPQs
(Pushing Monotone Constraint)
Part 3 Conclusion
Optimized Operational Semantics for FPQs
(putting together Part 1 and Part 2)

3
Plan of the talk

Language for Frequent Pattern Queries
Algorithms for Constrained Frequent Pattern
Mining
Adaptive Constraint Pushing Bonchi et al.
PKDD03
ExAnte preprocessing Bonchi et al. PKDD03
Further exploiting the ExAnte property
ExAMiner (breadth-first) Bonchi et al.
ICDM03
FP-bonsai (depth-first) Bonchi and Goethals
PAKDD04
On going and future work P3D project

4
Interesting feature of all our algorithms

They provide the exact support for all solution
itemsets.
This feature distinguish our algorithms from
those ones presented by Luc (this morning) and by
Johannes (this afternoon).
In some sense they solve different problems
Analogy with classical frequent itemsets without
constraints
Our algorithms ? frequent itemset mining
(Apriori)
Their algorithms ? maximal frequent itemset
mining (MaxMiner)

5
Language for Frequent Pattern Queries
6
Our Objective

To study Frequent Pattern Queries optimizations
in a Logic-based Knowledge Discovery Support
Environment
a flexible knowledge discovery systems,
capable of obtaining, mantaining, representing
and using both induced and deduced knowledge in a
unified framework.
Need for such a system is suggested by the
real-world mining applications Bonchi et al.
KDD99 Giannotti et al. DMKD99
A Deductive Database can easily represents both
extensional and intensional data.
Previous works by our group have shown that this
capability makes it viable for suitable
representation of domain knowledge and support of
the various steps of the KDD process.
LKDSE Deductive Database a set of inductive
primitives.
LKDSE Inductive Database where the DB component
is a Deductive DB.

7
Logic-based Knowledge Discovery Support
Environment
(Inductive rules)
Closure Principle!
MINING
KNOWLEDGE
(Relational extensions) source data background
knowledge extracted knowledge
(Deductive rules)
(Deductive rules)
POSTPROCESSING Reasoning on extracted knowledge
PREPROCESSING Background knowledge integration
8
Logic-based Knowledge Discovery Support
Environment

The main issue for a deductive approach
how to choose a suitable representation for
the inductive part? (In other words how to
define inductive mining queries?)
Mancos Ph.D. thesis
inductive queries user-defined aggregates
LDL-Mine
LDL (deductive database language and system)
user-defined aggregates external calls to
implement mining primitives
Main drawback atomicity of the aggregate gives
no optimization opportunity. Boulicaut and
DeRaedt PKDD02 Tutorial
Example constraint pushing techiniques in
frequent itemsets mining would require the
analyst to create her own aggregates for any
different situation (conjunction of constraints).

9
Our Vision Declarative Mining

The analyst must have a high-level vision of the
knowledge discovery system, without worrying
about the details of the computational engine, in
the very same way a database designer has not to
worry about query optimizations.
She just needs to declaratively specify in the
inductive query how the desired patterns should
look like and which conditions they must satisfy
(a set of constraints).
It will be due to the query optimizer to compose
all constraints and to produce the most efficient
mining strategy (? execution plan) for the given
inductive query.

10
Inductive Rule

INDUCTIVE RULE a conjunction of sentences about
the desired patterns.
H ? B1,,Bn
H is a relation representing the induced pattern.
Sentences B1, Bn are taken from a restricted
class of sentences.
The set of all allowed sentences is just some
"syntactic sugar" on top of an algorithm (the
inductive engine).
Each sentence can be defined over some relations.
Having a well defined and restricted set of
admitted sentences allow us to write higly
optimized algorithms to compute inductive rules
with any conjunction of sentences.
In particular we focus on Frequent Pattern
Queries

11
Why Frequent Pattern Queries?

Frequency of a pattern is the most important
interestingness measure.
Frequency has the right granularity to be a
primitive in a DMQL
its a simple low-level concept
complex and time-consuming computation
many different kinds of pattern are based on
frequency
frequent itemsets
frequent sequences (or sequential pattern)
frequent episodes
frequent substructures in graph data
can be used to define a whole class of data
mining tasks
Association Rules, Correlation Rules, Casuality
Rules, Ratio Rules,
Iceberg Queries and Iceberg Rules, Partial
Periodicity, Emerging Pattern,
Classification, Clustering

12
Research Path Followed

trying to define a language for FPQs expressive
enough to express the most of interesting
inductive queries, simple enough to be higly
optimized.
FPQ Definition ? identification of all basic
components of a FPQ
Syntax ? syntactic sugar to express all basic
components of a FPQ
Safety ? not all inductive rules derivable from
the provided grammar are meaningfull
Formal Semantics ? by showing that exists a
unique mapping from each safe FPQ (inductive
rule) of our language to a Datalog program (set
of deductive rules) with user-defined aggregates
(Mancos framework). Thanks to this mapping we
can define the formal declarative semantics of an
inductive rule as the iterated stable model of
the corresponding Datalog program.
Expressiveness ? by means of a suite of examples
of interesting complex queries.

13
Inductive Query Example
Compute simple association rules, having exactly
2 items in the head and at least 3 items in the
body, creating transactions by grouping tuples by
day and customer, having support greater than
1000 and confidence more than 0.4, and spending
at least 50 in toys (total sum of prices of
items of type toys involved in the rule).
Inductive rule interesting_set(Set,Sup,Card,T) ?
Sup freq(Set,X), X ?I ?D,C??,
sales(D,C,I,Q), Sup ? 1000, Card card(Set),
J ? Set, T
sum(P,product(J,P,toy)).

Deductive (LDL) rule interesting_rules(L,R,Sup,C
onf) ? interesting_set(Set,Sup,Card,T), Card ? 5,
T ? 50, interesting_set(R,S1,2,T1),
subset(R,Set), difference(Set,R,L), Conf
Sup / S1, Conf ? 0.4.
14
Algorithms for Constrained Frequent Pattern
Mining
15
Why Constraints?

Frequent pattern mining usually produces too many
solution patterns. This situation is harmful for
two reasons
Performance mining is usually inefficient or,
often, simply unfeasible
Identification of fragments of interesting
knowledge blurred within a huge quantity of
small, mostly useless patterns, is an hard task.
Constraints are the solution to both these
problems
they can be pushed in the frequent pattern
computation exploiting them in pruning the search
space, thus reducing time and resources
requirements
they provide to the user guidance over the mining
process and a way of focussing on the
interesting knowledge.
With constraints we obtain less patterns which
are more interesting. Indeed constraints are the
way we use to define what is interesting.

16
Constrained Frequent Itemset Mining Problem

Notation
We indicate the frequency constraint with Cfreq
without explicitely indicating the dataset and
the min_sup
Given a constraint C , let Th(C) X C(X)
denote the set of all itemsets X that satisfy C.
The frequent itemsets mining problem requires to
compute Th(Cfreq)
The constrained frequent itemsets mining problem
requires to compute Th(Cfreq) ? Th(C).

17
Problem Definition Anti-monotone Constraint

Frequency is an anti-monotone constraint.
"Apriori trick" if an itemset X does not
satisfy Cfreq, then no superset of X can satisfy
Cfreq.
Other examples of anti-monotone constraint
sum(X.prices) ? 20 euro
X ? 5

18
Problem Definition Monotone Constraint
19
Our Problem
to compute itemsets which satisfy a conjunction
of anti-monotone and monotone
constraints.

Why Monotone Constraints?
Theyre the most useful in order to discover
local high-value patterns (for instance very
expansive or very large itemsets which can be
found only with a very small min-sup)
We know how to exploit the other kinds of
constraints (antimonotone, succinct) since 98
Ng et al. SIGMOD98, while for monotone
constraints the situation is more complex

20
Search Space Characterization
21
Problem Characterization
22
Tradeoff between AM and M pruning
23
The 2 extremes priority to AM

Strategy GT (Generate and Test)
Apriori level-wise computation
followed by the test of CM
Maximum possible antimonotone pruning. No
monotone pruning.

24
The 2 extremes priority to M

Strategy MCP (Monotone Constraint Pushing)
First finds B(CM). Then only itemsets over
the border are generated as candidates to be
tested for frequency.
Candidate generation function generate_over
Boulicaut and Jeudy 2000 add to each solution
at the previous level 1 item.
Antimonotone pruning possible only partially.
Maximum possible monotone pruning. Little
antimonotone pruning.

25
Strategies Analysis (w.r.t. frequency tests
performed)
GT MCP
Solutions
Not prunable
Ideal
AM prunable
M prunable
M prunable
AM and M prunable
B-(Cfreq )
No one of the two strategies outpermorfs the
other on every input problem
26
Adaptive Constraint Pushing
27
Adaptive Constraint Pushing

ACP explores a portion of Th(? CM) looking for
infrequent itemsets (negative border of
frequency).
Each infrequent itemset found (region 5) will
help in AM-pruning the search space in Th(CM)
(in particular region 3).
Each infrequent itemset lost (region 5) will
induce exploration of a portion of region 6.
Each itemset found frequent (region 4) will be
just a useless frequency test performed.
2 questions
What is a good portion of candidates?
How large this portion should be?

Itemsets which have higher probability to be
found infrequent

This is what the adaptivity of ACP is about

28
Adaptivity Parameter

We introduce a parameter ? ? 0,1 which
represents the fraction of candidates under the
monotone border to be chosen among all possible
candidates under the monotone border.
Initialized after the first scan TDB using all
information available.
Example
Number of transactions in TDB
Total number of 1-itemsets (singleton items)
Number of frequent 1-itemsets
Number of 1-itemsets satisfying the monotone
constraint
Ect...
Updated level by level with the newly collected
knowledge.
Updating ? ACP adapts its behaviour to the given
input TDB and constraints ...
Extreme cases
If ? 0 constantly then ACP ? MCP
If ? 1 constantly then ACP ? GT

29
The Algorithm

Notation at iteration k (k-itemsets)
Pk Set of itemsets whose proper subsets do not
satisfy CM and have not been found infrequent
Bk subset of Pk containing itemsets which
satisfy CM (positive monotone border)
Ek Pk\ Bk
CkO candidates over the monotone border
CkU candidates under the monotone border
Rk solutions
Nk itemsets under the monotone border found
infrequent
N union of all Nk

30
The Algorithm
generate_over
Rk-1
CkO
Rk
N
CkU
Bk
Pk
Satisfy CM ?
Ek
?
generate_apriori
Pk1
31
The Algorithm
32
? selection and adaptivity

How candidates are selected by ??
Among all itemsets in Ek the ?-portion with
lowest estimated support is selected to enter in
CkU.
Support is estimated using the real support
values of singleton items belonging to the
itemset, balancing between complete independence
(products of values) and maximal correlation
(minimum value).
How does ? adapt itself?
According to the performance of the ?-selection
at the previous iteration
?-focus Nk / CkU
An ?-focus very close to 1 ? very good selection
? probabily ? is selecting too few candidates ?
it risks to lose some infrequent itemsets ? ?
value is raised accordingly
A low ?-focus ? poor selection ? ? is selecting
too much candidates ? ? value is reduced
accordingly

33
Experimental Results
34
ExAnte(a preprocessing algorithm)
35
AM Vs. M

State of art before ExAnte when dealing with a
conjunction of AM and M constraints we face a
tradeoff between AM and M pruning.
Tradeoff pushing M constraints into the
computation can help pruning the search space,
but at the same time can lead to a reduction of
AM pruning opportunities.
Our observation this is true only if we focus
exclusively on the search space of itemsets.
Reasoning on both the search space and the input
TDB together we can find the real sinergy of AM
and M pruning.
The real sinergy do not exploit M constraints
directly to prune the search space, but use them
to reduce the data, which in turn induces a much
stronger pruning of the search space.
The real sinergy of AM and M pruning lies in Data
Reduction

36
ExAnte ?-reduction

Definition ?-reduction
Given a transaction database TDB and a monotone
constraint CM, we define the ?-reduction of TDB
as the dataset resulting from pruning the
transactions that do not satisfy CM.
Example CM ? sum(X.price) ? 55

37
ExAnte Property
38
ExAnte ?-reduction

?-reducing a transaction means to delete from the
transaction infrequent singleton items (or more
generally singleton items which do not satisfy a
given anti-monotone constraint)
?-reducing a database of transaction means to
?-reduce all transaction in the database.
?-reducing a database is correct, i.e. it does
not change support to solution itemsets (trivial
by anti-monotonicity)

39
A Fix-Point Computation
Shorter transactions
Less transactions which satisfy CM
TDB
Less frequent 1-itemsets
until a fix-point is reached
Less Transactions in TDB
40
ExAnte Preprocessing Example

Min_sup 4
CM ? sum(X.price) ? 45

4 7 5 7 4 3 6 2
4 4 4
X
X
X
X
X
41
Experimental Results Data Reduction
42
Experimental Results Items Reduction
43
Experimental Results Search Space Reduction
44
Experimental Results Runtime Comparison
45
Further Exploiting the ExAnte Property

ExAMiner
ExAnte Miner (in contrast with ExAnte
preprocessor)
Miner which Exploits Antimonotone and Monotone
constraints togheter
Basic idea to generalize ExAnte data reduction
at all levels of a level-wise Apriori-like
computation.
Performs better on sparse datasets.
Breadth-first
FP-Bonsai the art of growing and pruning small
FP-Trees
Basic idea embedding ExAnte data-reduction in
FP-growth computation.
Performs very well on both dense and sparse
datasets.
Depth-first
ExAnte property works even in other pattern
domains (sequences, trees, graphs)

46
P3D Projecthttp//www-kdd.isti.cnr.it/p3d/
an ISTI C.N.R. internal curiosity driven
project
Pisa KDD Laboratory
High Performance Computing Laboratory ( DCI
people Salvatore Orlando, Raffaele Perego and
others)
47
Activities
Patternist devising knowledge discovery support
environment focused on frequent pattern
discovery which offers the repertoire of
algorithms studied and implemented by the
researchers participating to the project in the
last few years PDQL devising a highly
expressive query language for frequent pattern
discovery PPDM devising privacy-preserving
methods for frequent pattern discovery from
sources that typically contain personal sensitive
data Applications devising some benchmarking
test bed in the domain of biological data
developed within the above environment. Other
kinds of pattern closed-frequent itemsets,
sequential patterns, graph-based frequent
patterns ...
48
What about constraint-based frequent itemset
mining _at_ FIMI04?

Write a Comment

User Comments (0)