A Perspective on Inductive Databases - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

A Perspective on Inductive Databases

Description:

... Johannes Fischer, Christoph Helma, Manfred Jaeger, Stefan Kramer, Heikki Mannila ... search for interesting and understandable patterns in data ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:5.0/5.0
Slides: 64
Provided by: stefan129
Category:

less

Transcript and Presenter's Notes

Title: A Perspective on Inductive Databases


1
A Perspective on Inductive Databases
  • Luc De Raedt
  • Institut für Informatik
  • Albert-Ludwigs Universität Freiburg
  • Germany
  • deraedt_at_informatik.uni-freiburg.de

2
Joint work with Lee Sau Dan, Johannes Fischer,
Christoph Helma, Manfred Jaeger, Stefan Kramer,
Heikki Mannilaand cInQ
3
Three Parts
  • Introduction to Inductive Databases
  • Inductive Database Systems
  • MolFea Mining features in Molecules
  • (MineSeqLog Mining logical sequences)
  • Inductive databases
  • Integrate data mining with databases
  • Querying for patterns using constraints

4
Inductive databases
  • Data mining
  • search for interesting and understandable
    patterns in data
  • State-of-the-art in data mining databases in
    the early days
  • A theory of data mining is lacking
  • View by Iemielinski and Mannila (CACM 96)
  • Make first class citizens out of patterns
  • Query not only the data but also the patterns
  • Tightly integrate data mining and databases

5
Inductive Querying for Active Mining
  • The topic of the European cInQ project
  • Consortium on Inductive Querying
  • Active Mining has many aspects
  • One important aspect is
  • interactive mining
  • interaction with the user
  • user-support for mining

6
Inter-active Mining
  • The need to actively mine / analyze scientific
    databases in biology, chemistry
  • Understandable patterns needed
  • Scientist wants control of mining process
  • Constraint based mining
  • Constraints specify patterns of interest
  • E.g. find all patterns that occur in at least 30
    the actives and at most 3 of the inactives and
    contain a benzene ring
  • Mining becomes a querying process
  • There is no such thing as real discovery, just a
    matter of the expressive power of the query
    languages Iemielinski Mannila, CACM 96

7
Part II
  • Examples of (simple) inductive database systems
  • MolFea the Molecular Feature Miner
  • MineSeqLog Mining logical sequences

8
Molecular Feature Mining
  • What ?
  • Find fragments (substructures) of interest in
    sets of molecules
  • Why ?
  • Discover new knowledge
  • Use in predictive models
  • SAR (Structure Activity Relationship)

9
Molecules and Fragments
  • 2D-structure
  • essentially Graphs
  • Fragments
  • substructres
  • We linear fragments
  • Sequence of atoms and bonds
  • Linear fragments
  • o, c, cl, n, s,... denote elements
  • - ... single bond ... double bond ...
    triple bond ... aromatic bond
  • (hydrogens implicit)
  • Smarts encoding

O-cccc-Cl
10
Smiles encoding
  • Smiles
  • Compact encoding of molecular structure
  • Used by computational chemists
  • Supported by many tools (e.g. Daylight)
  • Very compact !
  • Efficient matching

11
Smiles encoding
2
1
12
Constraint-based Data Mining
  • What ?
  • Use constraints to specify which
    fragments/patterns are interesting
  • E.g. Frequency and syntax
  • Why ?
  • Declarative Querying
  • Interactive Process
  • Inductive database idea

13
Constraints on Fragments
  • Possibility to specify constraints on the
    fragments of interest.
  • Regarding their generality,regarding their
    frequency in the data
  • Various constraint solvers have been implemented

14
Constraint-based data mining
  • Generality
  • One fragment is more general than another one if
    it is a substructure of the other one
  • Notation g ? s (g is more general than s i.e.
    g will match a graph/string whenever s does)
  • Graphs subgraph relationship
  • Strings substring / subsequence relationship
  • E.g. aabbcc is more general than ddaabbccee
    (substring)
  • E.g. abc is more general than aabbcc (
    subsequence)
  • (Item)sets subset relation, e.g. a,b subset
    a,b,c

15
Search Space for Strings
16
Primitives
  • Generality MolFea Symmetry !
  • g is equivalent to s (syntactic variants) only
    when they are a reversal of one another
  • E.g. C-O-S' and S-O-C' denote the same
    substructure
  • g is more general than s if and only if g is a
    subsequence of s or g is a subsequence of the
    reversal of s
  • E.g. Cl-O-S' ? Cl-O-S-ccc'
  • E.g., O-Cl' ? Cl-O-S'
  • Frequency of a fragment f on a data set D
  • The percentage of data points in D that f occurs
    in
  • E.g let f be aa and let Dabaa,acc, caa
    freq(f,D) .662/3

17
Primitive Constraints
  • f ? P, P ? f, not (f ? P) and not (P ? f)f ...
    unknown target fragment, P ... a specific
    fragment e.g. abbaa ? f
  • freq(f, D)relative frequency of a fragment f on
    a data setD
  • freq(f, D1) ? t, freq(f, D2) ? t,t ... positive
    real number between 0 and 1D1, D2 ... Data sets
    e.g. freq(f, Pos) ? 0.20

18
Example query
  • Let E1 aabbcc,abbc,bb
  • Let E2 abc,bc,cc
  • freq(f,E1) ? 2 and freq(f,E2) 0 and a lt f
  • Solutions abb and abbc

19
Example Queries
  • (N-O'? f) ? (freq(f, Act) ? 0.1) ? (freq(f,
    Inact) ? 0.01) 
  • not(F' ? f) ? not (Cl' ? f) ? not (Br' ? f) ?
    not (I' ? f) ? (freq(f, Act) ? 0.05) ?
    (freq(f, Inact) ? 0.02)
  • Queries are conjunctions of primitive constraints

20
Representing Solutions
  • Traditional min. frequency constraint
  • Let c be freq(f, Act) ? x
  • c satisfies Anti Monotonicity property
  • If we have a fragment g ? s,
  • Then if s is a solution then g is a solution as
    well
  • Imposes a lower border Smax(Sol) on the space of
    solutions

21
A String Example
  • freq(f,D) ? 2 where D

22
Another String Example
  • Let f ? ABD

23
Representing Solutions
  • Traditional max frequency constraint
  • Let c be freq(f, Act) lt x
  • c satisfies Monotonicity property
  • If we have a fragment g ? s,
  • Then if g is a solution then s is a solution as
    well
  • Imposes an upper border Gmin(Sol) on the space
    of solutions

24
A String Example
  • Consider B ? f and freq(f,D) ? 2 with D

25
Constraints
  • Anti-monotonic
  • In ML
  • Monotonic
  • In ML

26
Mitchells Version Space
  • Consider now a conjunctive query
  • We want to compute

27
Mitchells Version Spaces
Is more general
Solutions
28
Some problems
  • There exist conjunctive queries q such that
    Sol(q) is not boundary set representable these
    queries are not safe
  • Boundary sets may be infinite
  • Or may not be complete

29
Computing Borders
  • Borders completely characterize the set of
    solutions for safe queries
  • If solution set is finite, then query is safe
  • Combination of well-known algorithms to compute
    border wrt
  • Level wise algorithm by Agrawal et al., Mannila
    and Toivonen
  • Mitchells and Mellishs version space algorithms
  • In our level wise version space algorithm

30

Levelwise Version Spaces
Minimum frequency (or anti-monotonic constraint)
Is more general
31
Dual computation
min frequency
G
Is more general
S
Swap role of frequent and infrequent fragments
Expand infrequent ones Discard frequent ones
during search
32
Level Wise Version Space Algorithm
max frequency (or monotonic constraint)
G
Is more general
G
S
33
The HIV Data Set
  • Developmental Therapeutics Programs AIDS
    Antiviral Screen Database (http//dtp.nci.nhi.gov
    )
  • One of the largest public domain databases of
    this type
  • Measures protection of human CEM cells from HIV-1
    infection using a soluble formazan assay
  • We retained 41768 compounds (after pre-processing
    the whole data set of 43382 ones)
  • 40282 Confirmed Inactive
  • 1069 Confirmed Moderately Active
  • 417 Confirmed Active

34
Experimental Setup
  • Discover patterns that are, statistically
    significant, over-represented in the active
    compounds and under-represented in the inactive
    ones
  • Minimum frequency in actives 3, i.e. 13
    compounds
  • Maximum frequency on inactives computed using ?2
    (0.999) and size of classes
  • For CM 8 CI 516
  • Matching Smiles and Smarts using Daylight Tool !

35
Levelwise Version Space Algorithm
max frequency
G
Is more general
G
S
36
(No Transcript)
37
Discovered Fragments(Actives vs. Moderately
Actives)
38
Discovered Fragments(Actives vs. Inactives)
39
AZT (Azidothymidine)
  • The majority of these fragments are derivatives
    of AZT.
  • Gives insight into the structural requirements
    for anti-HIV activity.
  • A rediscovery that proves the principle
  • Post-processing
  • Combine fragments ?

40
Use of Fragments SAR
  • Use as fingerprints/descriptors for SAR model
    building
  • Feed data into your favorite data
    mining/statistical package
  • Neural Nets
  • Decision Trees
  • (Logistic) Regression
  • Support Vector Machines
  • Bayesian Methods
  • Principal Component Analysis

41
Use of Fragments SAR
  • Several experiments reported on problems from
    predictive toxicology, cf. Kramer and De Raedt,
    ICML 01
  • Best results in combination with SVMs
  • 2 year rodent carcinogenicity assay (NTP) 70
    500 compounds
  • Mutagenicity (Ames Test) 80 800 compounds
  • Method has proven its use in several benchmarks
    problems

42
Ongoing WorkMolFea
  • Work with branched fragments instead of linear
    sequences
  • conceptually easy, computationally more expensive
  • Use abstractions, e.g. H-bond-donor/acceptor
    lipophilic center,
  • Deriving 3D fragments
  • Annotate fragments with 3D information
  • Initial implementation works
  • Goal mining for pharmacophores
  • Integrate MolFea in existing chemical databases
    with GUI for interactive exploration
  • Various activities on the solver side

43
Mining SeqLog
  • SeqLog simple vanilla datalog like language
    for structured sequences
  • MineSeqLog supports same primitives as MolFea

44
Principles of SeqLog
45
SeqLog for Mining
  • It is possible to define a notion of
  • Substring
  • Subsequence
  • Resolution and Fixpoint
  • For SeqLog Programs

46
MineSeqLog
  • Apply the idea of inductive databases to SeqLog,
    i.e. use constraints such as
  • Minimum and maximum frequency
  • Generality
  • Related to subsequence / substring matching
  • Background knowledge
  • To specify patterns (sequences) of interest

47
Other idea
  • Use SeqLog as a dedicated representation language
    for data mining (a la Inductive Logic
    Programming)
  • Many interesting open (?) questions
  • Distance based learning methods
  • Require distance measure on sequences
  • Operations on sequences required to do learning
  • STRONG relation to corresponding operations on
    strings ! E.g. maximal or longest common
    subsequence/substrings
  • Hidden Markov Models (Kersting et. al. PSB 03)

48
Part III
  • Inductive database principles

49
An analogy with databases
  • Why is the relational model so succesful?
  • A general purpose query language with  nice 
    properties
  • simple theoretical foundations
  • declarative semantics
  • closure principle
  • The same is needed for KDD applications
  • The ultimate goal of IDBs is to find the
    equivalent of Codds relational database model
    for use in data mining

50
Inductive database principles
  • What is an inductive database ?
  • A set of data sets
  • A set of pattern sets
  • IDB languages
  • A query language that generates data sets
  • An inductive query language that generates
    pattern sets
  • Closure principle !
  • A set and logic oriented view
  • Not a universal framework, though quite general

51
Boolean Inductive Queries
52
Manipulation
  • create data set D as query
  • create view data set D as query
  • create pattern set P as query
  • create pattern view P as query
  • Insert / Delete / Update statements

53
Illustration
  • create data set D4 as aa,ab,bb
  • create pattern view P2 as freq(f,D4) ? 2
  • At this point P2 a,b
  • update data set D4 insert abc
  • Update P2 too P2 P2 U ab
  • Incremental data mining !
  • Insert ab into pattern view P2
  • Pattern view update problem
  •  

54
Query evaluation
  • How to evaluate boolean inductive queries ?
  • Observe
  • MolFea conjunction of anti-monotonic and
    monotonic constraints
  • can be answered using level wise version space
    algorithm
  • solutions form a version space, can be
    represented by border sets.

55
Query Evaluation
56
Query Optimisation
57
Properties of inductive queries
58
How many version spaces do we need ?
59
Operations on solution spaces
  • Logical operations on primitives have a set
    oriented counter part
  • An analogy with relational algebra
  • A query consists of relational operations
  • Operations can be used to optimize query
    answering proces.

60
Operations on solution spaces
  • Two approaches
  • Develop data structures that support operations
  • Develop operations that work on border sets
  • Cf. Employ operations by Haym Hirsh, Gunter et
    al. for our purposes
  • Combine the two approaches
  • Version Space Trees

61
Version Space Intersections
62
Version space intersections
63
Version space union
64
Union on Borders
65
(No Transcript)
66
Version space tree
67
Version space trees
  • Interesting properties
  • Membership testing very efficient
  • Size of VSTree at most
  • Easy to go from VSTree to G and S, and vice versa
  • Can be constructed in two phases
  • Descend (Apriori-tries), Ascend
  • Combines advantages of suffix trees with version
    spaces
  • Operations on version space trees For now finite
    trees only.

68
Reasoning
69
Memory organisation
  • Consider
  • q1 freq(f,D) gt m
  • q2 freq(f,D U M) gt m (q1 q2)
  • q3freq(f,D) gt m OR freq(f,M) gt m (q3 q2)
  • Scenarios
  • q1 answered and stored q2 asked
  • q2 answered and stored q1 asked
  • Keep track of subset relations among pattern sets
    / data sets
  • Keep track of relations among patterns
    (generality structure) within given pattern set

70
A set and logic oriented view of inductive
databases
  • Key assumption
  • Inductive queries are logical expressions over
    monotonic and anti-monotonic prims.
  • Perspective
  • Reasoning about query answering and optimisation
  • (first) elements of a theory given
  • Border set (Version space) representations useful
  • Operations on version spaces
  • A lot of opportunities for further work

71
Ongoing work
  • String version space data structure
  • Operations on string version spaces
  • Efficient computation of string version space
  • Elaborate theory and implementation

72
Where to go from here ?
  • Other forms of primitives ?
  • E.g. accuracy of rule / hypotheses is larger than
    x
  • E.g. average cost of transaction is larger than x
  • Neither monotone nor anti-monotone
  • Optimization primitives ?
  • Find item sets with maximum frequency
  • Find rule with maximum accuracy

73
Where to go from here ?
  • Other forms of tasks ?
  • Clustering (some initial works exist)
  • Formulate constraints on no. of desired clusters,
    and cluster membership
  • Prediction
  • Some approaches to decision tree learning exist
  • Other forms of algorithms ?
  • Instead of all solutions find best or
    plausible solutions
  • Approximation/heuristic algorithms
  • Cf. constraint programming

74
Conclusions
  • Inductive databases and constraint based mining
  • MolFea
  • MineSeqLog
  • Solving inductive queries
  • Very general framework for query formulation
  • Problems of query evaluation and optimisation
    raised
  • Many remaining open problems and opportunities
    for research

75
Thanks
Write a Comment
User Comments (0)
About PowerShow.com