A Perspective on Inductive Databases - PowerPoint PPT Presentation

1 / 63

About This Presentation

Title:

A Perspective on Inductive Databases

Description:

... Johannes Fischer, Christoph Helma, Manfred Jaeger, Stefan Kramer, Heikki Mannila ... search for interesting and understandable patterns in data ... – PowerPoint PPT presentation

Number of Views:68

Avg rating:5.0/5.0

Slides: 64

Provided by: stefan129

Category:

more less

Transcript and Presenter's Notes

Title: A Perspective on Inductive Databases

1
A Perspective on Inductive Databases

Luc De Raedt
Institut für Informatik
Albert-Ludwigs Universität Freiburg
Germany
deraedt_at_informatik.uni-freiburg.de

2
Joint work with Lee Sau Dan, Johannes Fischer,
Christoph Helma, Manfred Jaeger, Stefan Kramer,
Heikki Mannilaand cInQ
3
Three Parts

Introduction to Inductive Databases
Inductive Database Systems
MolFea Mining features in Molecules
(MineSeqLog Mining logical sequences)
Inductive databases
Integrate data mining with databases
Querying for patterns using constraints

4
Inductive databases

Data mining
search for interesting and understandable
patterns in data
State-of-the-art in data mining databases in
the early days
A theory of data mining is lacking
View by Iemielinski and Mannila (CACM 96)
Make first class citizens out of patterns
Query not only the data but also the patterns
Tightly integrate data mining and databases

5
Inductive Querying for Active Mining

The topic of the European cInQ project
Consortium on Inductive Querying
Active Mining has many aspects
One important aspect is
interactive mining
interaction with the user
user-support for mining

6
Inter-active Mining

The need to actively mine / analyze scientific
databases in biology, chemistry
Understandable patterns needed
Scientist wants control of mining process
Constraint based mining
Constraints specify patterns of interest
E.g. find all patterns that occur in at least 30
the actives and at most 3 of the inactives and
contain a benzene ring
Mining becomes a querying process
There is no such thing as real discovery, just a
matter of the expressive power of the query
languages Iemielinski Mannila, CACM 96

7
Part II

Examples of (simple) inductive database systems
MolFea the Molecular Feature Miner
MineSeqLog Mining logical sequences

8
Molecular Feature Mining

What ?
Find fragments (substructures) of interest in
sets of molecules
Why ?
Discover new knowledge
Use in predictive models
SAR (Structure Activity Relationship)

9
Molecules and Fragments

2D-structure
essentially Graphs
Fragments
substructres
We linear fragments
Sequence of atoms and bonds
Linear fragments
o, c, cl, n, s,... denote elements
- ... single bond ... double bond ...
triple bond ... aromatic bond
(hydrogens implicit)
Smarts encoding

O-cccc-Cl
10
Smiles encoding

Smiles
Compact encoding of molecular structure
Used by computational chemists
Supported by many tools (e.g. Daylight)
Very compact !
Efficient matching

11
Smiles encoding
2
1
12
Constraint-based Data Mining

What ?
Use constraints to specify which
fragments/patterns are interesting
E.g. Frequency and syntax
Why ?
Declarative Querying
Interactive Process
Inductive database idea

13
Constraints on Fragments

Possibility to specify constraints on the
fragments of interest.
Regarding their generality,regarding their
frequency in the data
Various constraint solvers have been implemented

14
Constraint-based data mining

Generality
One fragment is more general than another one if
it is a substructure of the other one
Notation g ? s (g is more general than s i.e.
g will match a graph/string whenever s does)
Graphs subgraph relationship
Strings substring / subsequence relationship
E.g. aabbcc is more general than ddaabbccee
(substring)
E.g. abc is more general than aabbcc (
subsequence)
(Item)sets subset relation, e.g. a,b subset
a,b,c

15
Search Space for Strings
16
Primitives

Generality MolFea Symmetry !
g is equivalent to s (syntactic variants) only
when they are a reversal of one another
E.g. C-O-S' and S-O-C' denote the same
substructure
g is more general than s if and only if g is a
subsequence of s or g is a subsequence of the
reversal of s
E.g. Cl-O-S' ? Cl-O-S-ccc'
E.g., O-Cl' ? Cl-O-S'
Frequency of a fragment f on a data set D
The percentage of data points in D that f occurs
in
E.g let f be aa and let Dabaa,acc, caa
freq(f,D) .662/3

17
Primitive Constraints

f ? P, P ? f, not (f ? P) and not (P ? f)f ...
unknown target fragment, P ... a specific
fragment e.g. abbaa ? f
freq(f, D)relative frequency of a fragment f on
a data setD
freq(f, D1) ? t, freq(f, D2) ? t,t ... positive
real number between 0 and 1D1, D2 ... Data sets
e.g. freq(f, Pos) ? 0.20

18
Example query

Let E1 aabbcc,abbc,bb
Let E2 abc,bc,cc
freq(f,E1) ? 2 and freq(f,E2) 0 and a lt f
Solutions abb and abbc

19
Example Queries

(N-O'? f) ? (freq(f, Act) ? 0.1) ? (freq(f,
Inact) ? 0.01)
not(F' ? f) ? not (Cl' ? f) ? not (Br' ? f) ?
not (I' ? f) ? (freq(f, Act) ? 0.05) ?
(freq(f, Inact) ? 0.02)
Queries are conjunctions of primitive constraints

20
Representing Solutions

Traditional min. frequency constraint
Let c be freq(f, Act) ? x
c satisfies Anti Monotonicity property
If we have a fragment g ? s,
Then if s is a solution then g is a solution as
well
Imposes a lower border Smax(Sol) on the space of
solutions

21
A String Example

freq(f,D) ? 2 where D

22
Another String Example

Let f ? ABD

23
Representing Solutions

Traditional max frequency constraint
Let c be freq(f, Act) lt x
c satisfies Monotonicity property
If we have a fragment g ? s,
Then if g is a solution then s is a solution as
well
Imposes an upper border Gmin(Sol) on the space
of solutions

24
A String Example

Consider B ? f and freq(f,D) ? 2 with D

25
Constraints

Anti-monotonic
In ML

Monotonic
In ML

26
Mitchells Version Space

Consider now a conjunctive query
We want to compute

27
Mitchells Version Spaces
Is more general
Solutions
28
Some problems

There exist conjunctive queries q such that
Sol(q) is not boundary set representable these
queries are not safe
Boundary sets may be infinite
Or may not be complete

29
Computing Borders

Borders completely characterize the set of
solutions for safe queries
If solution set is finite, then query is safe
Combination of well-known algorithms to compute
border wrt
Level wise algorithm by Agrawal et al., Mannila
and Toivonen
Mitchells and Mellishs version space algorithms
In our level wise version space algorithm

30

Levelwise Version Spaces
Minimum frequency (or anti-monotonic constraint)
Is more general
31
Dual computation
min frequency
G
Is more general
S
Swap role of frequent and infrequent fragments
Expand infrequent ones Discard frequent ones
during search
32
Level Wise Version Space Algorithm
max frequency (or monotonic constraint)
G
Is more general
G
S
33
The HIV Data Set

Developmental Therapeutics Programs AIDS
Antiviral Screen Database (http//dtp.nci.nhi.gov
)
One of the largest public domain databases of
this type
Measures protection of human CEM cells from HIV-1
infection using a soluble formazan assay
We retained 41768 compounds (after pre-processing
the whole data set of 43382 ones)
40282 Confirmed Inactive
1069 Confirmed Moderately Active
417 Confirmed Active

34
Experimental Setup

Discover patterns that are, statistically
significant, over-represented in the active
compounds and under-represented in the inactive
ones
Minimum frequency in actives 3, i.e. 13
compounds
Maximum frequency on inactives computed using ?2
(0.999) and size of classes
For CM 8 CI 516
Matching Smiles and Smarts using Daylight Tool !

35
Levelwise Version Space Algorithm
max frequency
G
Is more general
G
S
36
(No Transcript)
37
Discovered Fragments(Actives vs. Moderately
Actives)
38
Discovered Fragments(Actives vs. Inactives)
39
AZT (Azidothymidine)

The majority of these fragments are derivatives
of AZT.
Gives insight into the structural requirements
for anti-HIV activity.
A rediscovery that proves the principle
Post-processing
Combine fragments ?

40
Use of Fragments SAR

Use as fingerprints/descriptors for SAR model
building
Feed data into your favorite data
mining/statistical package
Neural Nets
Decision Trees
(Logistic) Regression
Support Vector Machines
Bayesian Methods
Principal Component Analysis

41
Use of Fragments SAR

Several experiments reported on problems from
predictive toxicology, cf. Kramer and De Raedt,
ICML 01
Best results in combination with SVMs
2 year rodent carcinogenicity assay (NTP) 70
500 compounds
Mutagenicity (Ames Test) 80 800 compounds
Method has proven its use in several benchmarks
problems

42
Ongoing WorkMolFea

Work with branched fragments instead of linear
sequences
conceptually easy, computationally more expensive
Use abstractions, e.g. H-bond-donor/acceptor
lipophilic center,
Deriving 3D fragments
Annotate fragments with 3D information
Initial implementation works
Goal mining for pharmacophores
Integrate MolFea in existing chemical databases
with GUI for interactive exploration
Various activities on the solver side

43
Mining SeqLog

SeqLog simple vanilla datalog like language
for structured sequences
MineSeqLog supports same primitives as MolFea

44
Principles of SeqLog
45
SeqLog for Mining

It is possible to define a notion of
Substring
Subsequence
Resolution and Fixpoint
For SeqLog Programs

46
MineSeqLog

Apply the idea of inductive databases to SeqLog,
i.e. use constraints such as
Minimum and maximum frequency
Generality
Related to subsequence / substring matching
Background knowledge
To specify patterns (sequences) of interest

47
Other idea

Use SeqLog as a dedicated representation language
for data mining (a la Inductive Logic
Programming)
Many interesting open (?) questions
Distance based learning methods
Require distance measure on sequences
Operations on sequences required to do learning
STRONG relation to corresponding operations on
strings ! E.g. maximal or longest common
subsequence/substrings
Hidden Markov Models (Kersting et. al. PSB 03)

48
Part III

Inductive database principles

49
An analogy with databases

Why is the relational model so succesful?
A general purpose query language with nice
properties
simple theoretical foundations
declarative semantics
closure principle
The same is needed for KDD applications
The ultimate goal of IDBs is to find the
equivalent of Codds relational database model
for use in data mining

50
Inductive database principles

What is an inductive database ?
A set of data sets
A set of pattern sets
IDB languages
A query language that generates data sets
An inductive query language that generates
pattern sets
Closure principle !
A set and logic oriented view
Not a universal framework, though quite general

51
Boolean Inductive Queries
52
Manipulation

create data set D as query
create view data set D as query
create pattern set P as query
create pattern view P as query
Insert / Delete / Update statements

53
Illustration

create data set D4 as aa,ab,bb
create pattern view P2 as freq(f,D4) ? 2
At this point P2 a,b
update data set D4 insert abc
Update P2 too P2 P2 U ab
Incremental data mining !
Insert ab into pattern view P2
Pattern view update problem

54
Query evaluation

How to evaluate boolean inductive queries ?
Observe
MolFea conjunction of anti-monotonic and
monotonic constraints
can be answered using level wise version space
algorithm
solutions form a version space, can be
represented by border sets.

55
Query Evaluation
56
Query Optimisation
57
Properties of inductive queries
58
How many version spaces do we need ?
59
Operations on solution spaces

Logical operations on primitives have a set
oriented counter part
An analogy with relational algebra
A query consists of relational operations
Operations can be used to optimize query
answering proces.

60
Operations on solution spaces

Two approaches
Develop data structures that support operations
Develop operations that work on border sets
Cf. Employ operations by Haym Hirsh, Gunter et
al. for our purposes
Combine the two approaches
Version Space Trees

61
Version Space Intersections
62
Version space intersections
63
Version space union
64
Union on Borders
65
(No Transcript)
66
Version space tree
67
Version space trees

Interesting properties
Membership testing very efficient
Size of VSTree at most
Easy to go from VSTree to G and S, and vice versa
Can be constructed in two phases
Descend (Apriori-tries), Ascend
Combines advantages of suffix trees with version
spaces
Operations on version space trees For now finite
trees only.

68
Reasoning
69
Memory organisation

Consider
q1 freq(f,D) gt m
q2 freq(f,D U M) gt m (q1 q2)
q3freq(f,D) gt m OR freq(f,M) gt m (q3 q2)
Scenarios
q1 answered and stored q2 asked
q2 answered and stored q1 asked
Keep track of subset relations among pattern sets
/ data sets
Keep track of relations among patterns
(generality structure) within given pattern set

70
A set and logic oriented view of inductive
databases

Key assumption
Inductive queries are logical expressions over
monotonic and anti-monotonic prims.
Perspective
Reasoning about query answering and optimisation
(first) elements of a theory given
Border set (Version space) representations useful
Operations on version spaces
A lot of opportunities for further work

71
Ongoing work