Title: A Perspective on Inductive Databases
1A Perspective on Inductive Databases
- Luc De Raedt
- Institut für Informatik
- Albert-Ludwigs Universität Freiburg
- Germany
- deraedt_at_informatik.uni-freiburg.de
2Joint work with Lee Sau Dan, Johannes Fischer,
Christoph Helma, Manfred Jaeger, Stefan Kramer,
Heikki Mannilaand cInQ
3Three Parts
- Introduction to Inductive Databases
- Inductive Database Systems
- MolFea Mining features in Molecules
- (MineSeqLog Mining logical sequences)
- Inductive databases
- Integrate data mining with databases
- Querying for patterns using constraints
4Inductive databases
- Data mining
- search for interesting and understandable
patterns in data - State-of-the-art in data mining databases in
the early days - A theory of data mining is lacking
- View by Iemielinski and Mannila (CACM 96)
- Make first class citizens out of patterns
- Query not only the data but also the patterns
- Tightly integrate data mining and databases
5Inductive Querying for Active Mining
- The topic of the European cInQ project
- Consortium on Inductive Querying
- Active Mining has many aspects
- One important aspect is
- interactive mining
- interaction with the user
- user-support for mining
6Inter-active Mining
- The need to actively mine / analyze scientific
databases in biology, chemistry - Understandable patterns needed
- Scientist wants control of mining process
- Constraint based mining
- Constraints specify patterns of interest
- E.g. find all patterns that occur in at least 30
the actives and at most 3 of the inactives and
contain a benzene ring - Mining becomes a querying process
- There is no such thing as real discovery, just a
matter of the expressive power of the query
languages Iemielinski Mannila, CACM 96
7Part II
- Examples of (simple) inductive database systems
- MolFea the Molecular Feature Miner
- MineSeqLog Mining logical sequences
8Molecular Feature Mining
- What ?
- Find fragments (substructures) of interest in
sets of molecules - Why ?
- Discover new knowledge
- Use in predictive models
- SAR (Structure Activity Relationship)
9Molecules and Fragments
- 2D-structure
- essentially Graphs
- Fragments
- substructres
- We linear fragments
- Sequence of atoms and bonds
- Linear fragments
- o, c, cl, n, s,... denote elements
- - ... single bond ... double bond ...
triple bond ... aromatic bond - (hydrogens implicit)
- Smarts encoding
O-cccc-Cl
10Smiles encoding
- Smiles
- Compact encoding of molecular structure
- Used by computational chemists
- Supported by many tools (e.g. Daylight)
- Very compact !
- Efficient matching
11Smiles encoding
2
1
12Constraint-based Data Mining
- What ?
- Use constraints to specify which
fragments/patterns are interesting - E.g. Frequency and syntax
- Why ?
- Declarative Querying
- Interactive Process
- Inductive database idea
13Constraints on Fragments
- Possibility to specify constraints on the
fragments of interest. - Regarding their generality,regarding their
frequency in the data - Various constraint solvers have been implemented
14Constraint-based data mining
- Generality
- One fragment is more general than another one if
it is a substructure of the other one - Notation g ? s (g is more general than s i.e.
g will match a graph/string whenever s does) - Graphs subgraph relationship
- Strings substring / subsequence relationship
- E.g. aabbcc is more general than ddaabbccee
(substring) - E.g. abc is more general than aabbcc (
subsequence) - (Item)sets subset relation, e.g. a,b subset
a,b,c
15Search Space for Strings
16Primitives
- Generality MolFea Symmetry !
- g is equivalent to s (syntactic variants) only
when they are a reversal of one another - E.g. C-O-S' and S-O-C' denote the same
substructure - g is more general than s if and only if g is a
subsequence of s or g is a subsequence of the
reversal of s - E.g. Cl-O-S' ? Cl-O-S-ccc'
- E.g., O-Cl' ? Cl-O-S'
- Frequency of a fragment f on a data set D
- The percentage of data points in D that f occurs
in - E.g let f be aa and let Dabaa,acc, caa
freq(f,D) .662/3
17Primitive Constraints
- f ? P, P ? f, not (f ? P) and not (P ? f)f ...
unknown target fragment, P ... a specific
fragment e.g. abbaa ? f - freq(f, D)relative frequency of a fragment f on
a data setD - freq(f, D1) ? t, freq(f, D2) ? t,t ... positive
real number between 0 and 1D1, D2 ... Data sets
e.g. freq(f, Pos) ? 0.20
18Example query
- Let E1 aabbcc,abbc,bb
- Let E2 abc,bc,cc
- freq(f,E1) ? 2 and freq(f,E2) 0 and a lt f
- Solutions abb and abbc
19Example Queries
- (N-O'? f) ? (freq(f, Act) ? 0.1) ? (freq(f,
Inact) ? 0.01) - not(F' ? f) ? not (Cl' ? f) ? not (Br' ? f) ?
not (I' ? f) ? (freq(f, Act) ? 0.05) ?
(freq(f, Inact) ? 0.02) - Queries are conjunctions of primitive constraints
20Representing Solutions
- Traditional min. frequency constraint
- Let c be freq(f, Act) ? x
- c satisfies Anti Monotonicity property
- If we have a fragment g ? s,
- Then if s is a solution then g is a solution as
well - Imposes a lower border Smax(Sol) on the space of
solutions
21A String Example
22Another String Example
23Representing Solutions
- Traditional max frequency constraint
- Let c be freq(f, Act) lt x
- c satisfies Monotonicity property
- If we have a fragment g ? s,
- Then if g is a solution then s is a solution as
well - Imposes an upper border Gmin(Sol) on the space
of solutions
24A String Example
- Consider B ? f and freq(f,D) ? 2 with D
25Constraints
26Mitchells Version Space
- Consider now a conjunctive query
- We want to compute
-
27Mitchells Version Spaces
Is more general
Solutions
28Some problems
- There exist conjunctive queries q such that
Sol(q) is not boundary set representable these
queries are not safe - Boundary sets may be infinite
- Or may not be complete
29Computing Borders
- Borders completely characterize the set of
solutions for safe queries - If solution set is finite, then query is safe
- Combination of well-known algorithms to compute
border wrt - Level wise algorithm by Agrawal et al., Mannila
and Toivonen - Mitchells and Mellishs version space algorithms
- In our level wise version space algorithm
30 Levelwise Version Spaces
Minimum frequency (or anti-monotonic constraint)
Is more general
31Dual computation
min frequency
G
Is more general
S
Swap role of frequent and infrequent fragments
Expand infrequent ones Discard frequent ones
during search
32Level Wise Version Space Algorithm
max frequency (or monotonic constraint)
G
Is more general
G
S
33The HIV Data Set
- Developmental Therapeutics Programs AIDS
Antiviral Screen Database (http//dtp.nci.nhi.gov
) - One of the largest public domain databases of
this type - Measures protection of human CEM cells from HIV-1
infection using a soluble formazan assay - We retained 41768 compounds (after pre-processing
the whole data set of 43382 ones) - 40282 Confirmed Inactive
- 1069 Confirmed Moderately Active
- 417 Confirmed Active
34Experimental Setup
- Discover patterns that are, statistically
significant, over-represented in the active
compounds and under-represented in the inactive
ones - Minimum frequency in actives 3, i.e. 13
compounds - Maximum frequency on inactives computed using ?2
(0.999) and size of classes - For CM 8 CI 516
- Matching Smiles and Smarts using Daylight Tool !
-
35Levelwise Version Space Algorithm
max frequency
G
Is more general
G
S
36(No Transcript)
37Discovered Fragments(Actives vs. Moderately
Actives)
38Discovered Fragments(Actives vs. Inactives)
39AZT (Azidothymidine)
- The majority of these fragments are derivatives
of AZT. - Gives insight into the structural requirements
for anti-HIV activity. - A rediscovery that proves the principle
- Post-processing
- Combine fragments ?
40Use of Fragments SAR
- Use as fingerprints/descriptors for SAR model
building - Feed data into your favorite data
mining/statistical package - Neural Nets
- Decision Trees
- (Logistic) Regression
- Support Vector Machines
- Bayesian Methods
- Principal Component Analysis
-
41Use of Fragments SAR
- Several experiments reported on problems from
predictive toxicology, cf. Kramer and De Raedt,
ICML 01 - Best results in combination with SVMs
- 2 year rodent carcinogenicity assay (NTP) 70
500 compounds - Mutagenicity (Ames Test) 80 800 compounds
- Method has proven its use in several benchmarks
problems
42Ongoing WorkMolFea
- Work with branched fragments instead of linear
sequences - conceptually easy, computationally more expensive
- Use abstractions, e.g. H-bond-donor/acceptor
lipophilic center, - Deriving 3D fragments
- Annotate fragments with 3D information
- Initial implementation works
- Goal mining for pharmacophores
- Integrate MolFea in existing chemical databases
with GUI for interactive exploration - Various activities on the solver side
43Mining SeqLog
- SeqLog simple vanilla datalog like language
for structured sequences - MineSeqLog supports same primitives as MolFea
44Principles of SeqLog
45SeqLog for Mining
- It is possible to define a notion of
- Substring
- Subsequence
- Resolution and Fixpoint
- For SeqLog Programs
46MineSeqLog
- Apply the idea of inductive databases to SeqLog,
i.e. use constraints such as - Minimum and maximum frequency
- Generality
- Related to subsequence / substring matching
- Background knowledge
- To specify patterns (sequences) of interest
47Other idea
- Use SeqLog as a dedicated representation language
for data mining (a la Inductive Logic
Programming) - Many interesting open (?) questions
- Distance based learning methods
- Require distance measure on sequences
- Operations on sequences required to do learning
- STRONG relation to corresponding operations on
strings ! E.g. maximal or longest common
subsequence/substrings - Hidden Markov Models (Kersting et. al. PSB 03)
48Part III
- Inductive database principles
49An analogy with databases
- Why is the relational model so succesful?
- A general purpose query language with nice
properties - simple theoretical foundations
- declarative semantics
- closure principle
- The same is needed for KDD applications
- The ultimate goal of IDBs is to find the
equivalent of Codds relational database model
for use in data mining
50Inductive database principles
- What is an inductive database ?
- A set of data sets
- A set of pattern sets
- IDB languages
- A query language that generates data sets
- An inductive query language that generates
pattern sets - Closure principle !
- A set and logic oriented view
- Not a universal framework, though quite general
51Boolean Inductive Queries
52Manipulation
- create data set D as query
- create view data set D as query
- create pattern set P as query
- create pattern view P as query
- Insert / Delete / Update statements
53Illustration
- create data set D4 as aa,ab,bb
- create pattern view P2 as freq(f,D4) ? 2
- At this point P2 a,b
- update data set D4 insert abc
- Update P2 too P2 P2 U ab
- Incremental data mining !
- Insert ab into pattern view P2
- Pattern view update problem
-
54Query evaluation
- How to evaluate boolean inductive queries ?
- Observe
- MolFea conjunction of anti-monotonic and
monotonic constraints - can be answered using level wise version space
algorithm - solutions form a version space, can be
represented by border sets.
55Query Evaluation
56Query Optimisation
57Properties of inductive queries
58How many version spaces do we need ?
59Operations on solution spaces
- Logical operations on primitives have a set
oriented counter part - An analogy with relational algebra
- A query consists of relational operations
- Operations can be used to optimize query
answering proces.
60Operations on solution spaces
- Two approaches
- Develop data structures that support operations
- Develop operations that work on border sets
- Cf. Employ operations by Haym Hirsh, Gunter et
al. for our purposes - Combine the two approaches
- Version Space Trees
61Version Space Intersections
62Version space intersections
63Version space union
64Union on Borders
65(No Transcript)
66Version space tree
67Version space trees
- Interesting properties
- Membership testing very efficient
- Size of VSTree at most
- Easy to go from VSTree to G and S, and vice versa
- Can be constructed in two phases
- Descend (Apriori-tries), Ascend
- Combines advantages of suffix trees with version
spaces - Operations on version space trees For now finite
trees only.
68Reasoning
69Memory organisation
- Consider
- q1 freq(f,D) gt m
- q2 freq(f,D U M) gt m (q1 q2)
- q3freq(f,D) gt m OR freq(f,M) gt m (q3 q2)
- Scenarios
- q1 answered and stored q2 asked
- q2 answered and stored q1 asked
- Keep track of subset relations among pattern sets
/ data sets - Keep track of relations among patterns
(generality structure) within given pattern set
70A set and logic oriented view of inductive
databases
- Key assumption
- Inductive queries are logical expressions over
monotonic and anti-monotonic prims. - Perspective
- Reasoning about query answering and optimisation
- (first) elements of a theory given
- Border set (Version space) representations useful
- Operations on version spaces
- A lot of opportunities for further work
71Ongoing work
- String version space data structure
- Operations on string version spaces
- Efficient computation of string version space
- Elaborate theory and implementation
72Where to go from here ?
- Other forms of primitives ?
- E.g. accuracy of rule / hypotheses is larger than
x - E.g. average cost of transaction is larger than x
- Neither monotone nor anti-monotone
- Optimization primitives ?
- Find item sets with maximum frequency
- Find rule with maximum accuracy
73Where to go from here ?
- Other forms of tasks ?
- Clustering (some initial works exist)
- Formulate constraints on no. of desired clusters,
and cluster membership - Prediction
- Some approaches to decision tree learning exist
- Other forms of algorithms ?
- Instead of all solutions find best or
plausible solutions - Approximation/heuristic algorithms
- Cf. constraint programming
74Conclusions
- Inductive databases and constraint based mining
- MolFea
- MineSeqLog
- Solving inductive queries
- Very general framework for query formulation
- Problems of query evaluation and optimisation
raised - Many remaining open problems and opportunities
for research
75Thanks