Title: Connectionist Knowledge Representation and Reasoning Part I
1Connectionist Knowledge Representation and
Reasoning(Part I)
SCREECH
- Neural Networks and Structured Knowledge
Fodor, Pylyshin Whats deeply wrong with
Connectionist architecture is this Because it
acknowledges neither syntactic nor semantic
structure in mental representations, it perforce
treats them not as a generated set but as a
list. Connectionism and Cognitive Architecture
88
Our claim state-of-the-art connectionist
architectures do adequately deal with structures!
2Tutorial Outline (Part I)Neural networks and
structured knowledge
- Feedforward networks
- The good old days KBANN and co.
- Useful neurofuzzy systems, data mining pipeline
- State of the art structure kernels
- Recurrent networks
- The basics Partially recurrent networks
- Lots of theory Principled capacity and
limitations - To do Challenges
- Recursive data structures
- The general idea recursive distributed
representations - One breakthrough recursive networks
- Going on towards more complex structures
3Tutorial Outline (Part I)Neural networks and
structured knowledge
- Feedforward networks
- The good old days KBANN and co.
- Useful neurofuzzy systems, data mining pipeline
- State of the art structure kernels
- Recurrent networks
- The basics Partially recurrent networks
- Lots of theory Principled capacity and
limitations - To do Challenges
- Recursive data structures
- The general idea recursive distributed
representations - One breakthrough recursive networks
- Going on towards more complex structures
4The good old days KBANN and co.
feedforward neural network
- black box
- distributed representation
- connection to rules for symbolic I/O ?
y
x
neuron
fw Rn ? Ro
5The good old days KBANN and co.
- Knowledge Based Artificial Neural Networks
Towell/Shavlik, AI 94 - start with a network which represents known rules
- train using additional data
- extract a set of symbolic rules after training
6The good old days KBANN and co.
7The good old days KBANN and co.
data
train
use some form of backpropagation, add a penalty
to the error e.g. for changing the weights
- The initial network biases the training result,
but - There is no guarantee that the initial rules are
preserved - There is no guarantee that the hidden neurons
maintain their semantic
8The good old days KBANN and co.
(complete) rules
- There is no exact direct correspondence of a
neuron and a single rule, although each neuron
(and the overall mapping) can be approximated by
a set of rules arbitrarily well - It is NP complete to find a minimum logical
description for a trained network Golea,
AISB'96 - Therefore, a couple of different rule extraction
algorithms have been proposed, and this is still
a topic of ongoing research
9The good old days KBANN and co.
(complete) rules
decompositional approach
pedagogical approach
10The good old days KBANN and co.
- Decompositional approaches
- subset algorithm, MofN algorithm describe single
neurons by sets of active predecessors
Craven/Shavlik, 94 - local activation functions (RBF like) allow an
approximate direct description of single neurons
Andrews/Geva, 96 - MLP2LN biases the weights towards 0/-1/1 during
training and can then extract exact rules Duch
et al., 01 - prototype based networks can be decomposed along
relevant input dimensions by decision tree nodes
Hammer et al., 02 - Observation
- usually some variation of if-then rules is
achieved - small rule sets are only achieved if further
constraints guarantee that single weights/neurons
have a meaning - tradeoff between accuracy and size of the
description
11The good old days KBANN and co.
- Pedagogical approaches
- extraction of conjunctive rules by extensive
search Saito/Nakano 88 - interval propagation Gallant 93, Thrun 95
- extraction by minimum separation
Tickle/Diderich, 94 - extraction of decision trees Craven/Shavlik, 94
- evolutionary approaches Markovska, 05
- Observation
- usually some variation of if-then rules is
achieved - symbolic rule induction required with a little
(or a bit more) help of a neural network
12The good old days KBANN and co.
- Where is this good for?
- Nobody uses FNNs these days ?
- Insertion of prior knowledge might be valuable.
But efficient training algorithms allow to
substitute this by additional training data
(generated via rules) ? - Validation of the network output might be
valuable, but there exist alternative (good)
guarantees from statistical learning theory ? - If-then rules are not very interesting since
there exist good symbolic learners for learning
propositional rules for classification ? - Propositional rule insertion/extraction is often
an essential part of more complex rule
insertion/extraction mechanisms ? - Demonstrates a key problem, different modes of
representation, in a very nice way ? - Some people e.g. in the medical domain also want
an explanation for a classification ? - There are at least two application domains where
if-then rules are very interesting and not so
easy to learn fuzzy-control and unsupervised
data mining ?
13Tutorial Outline (Part I)Neural networks and
structured knowledge
- Feedforward networks
- The good old days KBANN and co.
- Useful neurofuzzy systems, data mining pipeline
- State of the art structure kernels
- Recurrent networks
- The basics Partially recurrent networks
- Lots of theory Principled capacity and
limitations - To do Challenges
- Recursive data structures
- The general idea recursive distributed
representations - One breakthrough recursive networks
- Going on towards more complex structures
14Useful neurofuzzy systems
process
input
observation
control
Fuzzy control
if (observation ? FMI) then (control ? FMO)
15Useful neurofuzzy systems
Fuzzy control
if (observation ? FMI) then (control ? FMO)
Neurofuzzy control
Benefit the form of the fuzzy rules (i.e. neural
architecture) and the shape of the fuzzy sets
(i.e. neural weights) can be learned from data!
16Useful neurofuzzy systems
- NEFCON implements Mamdani control
Nauck/Klawonn/Kruse, 94 - ANFIS implements Takagi-Sugeno control Jang, 93
- and many other
- Learning
- of rules evolutionary or clustering
- of fuzzy set parameters reinforcement learning
or some form of Hebbian learning
17Useful data mining pipeline
- Task describe given inputs (no class
information) by if-then rules - Data mining with emergent SOM, clustering, and
rule extraction Ultsch, 91
18Tutorial Outline (Part I)Neural networks and
structured knowledge
- Feedforward networks
- The good old days KBANN and co.
- Useful neurofuzzy systems, data mining pipeline
- State of the art structure kernels
- Recurrent networks
- The basics Partially recurrent networks
- Lots of theory Principled capacity and
limitations - To do Challenges
- Recursive data structures
- The general idea recursive distributed
representations - One breakthrough recursive networks
- Going on towards more complex structures
19State of the art structure kernels
kernel k(x,x)
data
just compute pairwise distances for this complex
data using structure information
sets, sequences, tree structures, graph
structures
20State of the art structure kernels
- Closure properties of kernels Haussler, Watkins
- Principled problems for complex structures
computing informative graph kernels is at least
as hard as graph isomorphism Gärtner - Several promising proposals - taxonomy Gärtner
derived from local transformations
semantic
count common substructures
derived from a probabilistic model
syntax
21State of the art structure kernels
- Count common substructures
GA AG AT
Efficient computation dynamic programming suffix
trees
GAGAGA
3 2 0
3
GAT
1 0 1
locality improved kernel Sonnenburg et al., bag
of words Joachims string kernel Lodhi et al.,
spectrum kernel Leslie et al. word-sequence
kernel Cancedda et al.
convolution kernels for language Collins/Duffy,
Kashima/Koyanagi, Suzuki et al. kernels for
relational learning Zelenko et al.,Cumby/Roth,
Gärtner et al.
graph kernels based on paths or subtrees Gärtner
et al.,Kashima et al. kernels for prolog trees
based on similar symbols Passerini/Frasconi/deRae
dt
22State of the art structure kernels
- Derived from a probabilistic model
describe by probabilistic model P(x)
compare characteristics of P(x)
Fisher kernel Jaakkola et al., Karchin et al.,
Pavlidis et al., Smith/Gales, Sonnenburg et al.,
Siolas et al. tangent vector of log odds Tsuda
et al. marginalized kernels Tsuda et al.,
Kashima et al.
kernel of Gaussian models Moreno et al.,
Kondor/Jebara
23State of the art structure kernels
- Derived from local transformations
is similar to
expand to a global kernel
local neighborhood, generator H
diffusion kernel Kondor/Lafferty,
Lafferty/Lebanon, Vert/Kanehisa
24State of the art structure kernels
- Intelligent preprocessing (kernel extraction)
allows an adequate integration of
semantic/syntactic structure information - This can be combined with state of the art neural
methods such as SVM - Very promising results for
- Classification of documents, text Duffy, Leslie,
Lodhi, - Detecting remote homologies for genomic sequences
and further problems in genome analysis
Haussler, Sonnenburg, Vert, - Quantitative structure activity relationship in
chemistry Baldi et al.
25Conclusions feedforward networks
- propositional rule insertion and extraction (to
some extend) are possible ? - useful for neurofuzzy systems, data mining ?
- structure-based kernel extraction followed by
learning with SVM yields state of the art results
? - but sequential instead of fully integrated
neuro-symbolic approach ? - FNNs itself are restricted to flat data which can
be processed in one shot. No recurrence ?
26Tutorial Outline (Part I)Neural networks and
structured knowledge
- Feedforward networks
- The good old days KBANN and co.
- Useful neurofuzzy systems, data mining pipeline
- State of the art structure kernels
- Recurrent networks
- The basics partially recurrent networks
- Lots of theory principled capacity and
limitations - To do challenges
- Recursive data structures
- The general idea recursive distributed
representations - One breakthrough recursive networks
- Going on towards more complex structures
27The basics partially recurrent networks
Elman, Finding structure in time, CogSci
90 very natural architecture for processing
speech/temporal signals/control/robotics
- can process time series of arbitrary length
- interesting for speech processing see e.g.
Kremer, 02 - training using a variation of backpropagation
see e.g. Pearlmutter, 95
xt1 f(xt,It)
28Tutorial Outline (Part I)Neural networks and
structured knowledge
- Feedforward networks
- The good old days KBANN and co.
- Useful neurofuzzy systems, data mining pipeline
- State of the art structure kernels
- Recurrent networks
- The basics Partially recurrent networks
- Lots of theory Principled capacity and
limitations - To do Challenges
- Recursive data structures
- The general idea recursive distributed
representations - One breakthrough recursive networks
- Going on towards more complex structures
29Lots of theory principled capacity and
limitations
- RNNs and finite automata Omlin/Giles, 96
input
output
state
dynamics of the transition function of a DFA
30Lots of theory principled capacity and
limitations
unary input
implement (approximate) the boolean formula
corresponding to the state transition within a
two-layer network
unary state representation
? RNNs can exactly simulate finite automata
31Lots of theory principled capacity and
limitations
unary input
cluster into disjoint subsets corresponding to
states and observe their behavior ? approximate
description
in general distributed state representation
? approximate extraction of automata rules is
possible
32Lots of theory principled capacity and
limitations
- The principled capacity of RNNs can be
characterized exactly
RNNs with arbitrary weights non uniform Boolean
circuits (super Turing capability)
Siegelmann/Sontag
RNNs with rational weights Turing
machines Siegelmann/Sontag
RNNs with limited noise finite state
automata Omlin/Giles, Maass/Orponen
RNNs with small weights or Gaussian noise
finite memory models Hammer/Tino, Maass/Sontag
33Lots of theory principled capacity and
limitations
- However, learning might be difficult
- gradient based learning schemes face the problem
of long-term-dependencies Bengio/Frasconi - RNNs are not PAC-learnable (infinite VC-dim),
only distribution dependent bounds can be derived
Hammer - there exist only few general guarantees for the
long term behavior of RNNs, e.g. stability
Suykens, Steil, -
error
?
?
tatata
tatatatatata
tatatatatatatatatatatata
tatatatatatatatatatatatatatatatatatatatatatatatata
ta
34Lots of theory principled capacity and
limitations
- RNNs
- naturally process time series ?
- incorporate plausible regularization such as a
bias towards finite memory models ? - have sufficient power for interesting dynamics
(context free, context sensitive arbitrary
attractors and chaotic behavior) ? - but
- training is difficult ?
- only limited guarantees for the long term
behavior and generalization ability ? - ? symbolic description/knowledge can provide
solutions
35Lots of theory principled capacity and
limitations
recurrent symbolic system
RNN
correspondence? e.g. attractor/repellor for
counting ? anbncn
real numbers iterated function systems give rise
to fractals/attractors/chaos, implicit memory
discrete states crisp boolean function on the
states explicit memory
36Tutorial Outline (Part I)Neural networks and
structured knowledge
- Feedforward networks
- The good old days KBANN and co.
- Useful neurofuzzy systems, data mining pipeline
- State of the art structure kernels
- Recurrent networks
- The basics Partially recurrent networks
- Lots of theory Principled capacity and
limitations - To do Challenges
- Recursive data structures
- The general idea recursive distributed
representations - One breakthrough recursive networks
- Going on towards more complex structures
37To do challenges
- Training RNNs
- search for appropriate regularizations inspired
by a focus on specific functionalities
architecture (e.g. local), weights (e.g.
bounded), activation function (e.g. linear), cost
term (e.g. additional penalties)
Hochreiter,Boden,Steil,Kremer - insertion of prior knowledge finite automata
and beyond (e.g. context free/sensitive, specific
dynamical patterns/attractors) Omlin,Croog, - Long term behavior
- enforce appropriate constraints while training
- investigate the dynamics of RNNs rule
extraction, investigation of attractors, relating
dynamics and symbolic processing
Omlin,Pasemann,Haschke,Rodriguez,Tino,
38To do challenges
- Some further issues
- processing spatial data
- unsupervised processing
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
bicausal networks, Pollastri et al., contextual
RCC, Micheli et al.
TKM, Chappell/Taylor, RecSOM, Voegtlin, SOMSD,
Sperduti et al., MSOM, Hammer et al., general
formulation, Hammer et al.
39Conclusions recurrent networks
- the capacity of RNNs is well understood and
promising e.g. for natural language processing,
control, ? - recurrence of symbolic systems has a natural
counterpart in the recurrence of RNNs ? - training and generalization faces problems which
could be solved by hybrid systems ? - discrete dynamics with explicit memory versus
real-valued iterated function systems ? - sequences are nice, but not enough ?
40Tutorial Outline (Part I)Neural networks and
structured knowledge
- Feedforward networks
- The good old days KBANN and co.
- Useful neurofuzzy systems, data mining pipeline
- State of the art structure kernels
- Recurrent networks
- The basics Partially recurrent networks
- Lots of theory Principled capacity and
limitations - To do Challenges
- Recursive data structures
- The general idea recursive distributed
representations - One breakthrough recursive networks
- Going on towards more complex structures
41The general idea recursive distributed
representations
- How to turn tree structures/acyclic graphs into a
connectionist representation?
42The general idea recursive distributed
representations
recursion!
inp.
fRixRcxRc?Rc yields
output
cont.
fenc where fenc(?)0 fenc(a(l,r)) f(a,fenc(l),fen
c(r))
cont.
43The general idea recursive distributed
representations
encoding fenc(Rn)2?Rc fenc(?)
0 fenc(a(l,r)) f(a,fenc(l),fenc(r))
right
decoding hdec Ro ?(Rn)2 hdec(0) ? hdec(x)
h0(x) (hdec(h1(x)), hdec(h2(x)))
fRn2c?Rc
gRc?Ro
hRo?Rn2o
44The general idea recursive distributed
representations
- recursive distributed description Hinton,90
- general idea without concrete implementation ?
- tensor construction Smolensky, 90
- encoding/decoding given by (a,b,c) ? a?b?c
- increasing dimensionality ?
- Holographic reduced representation Plate, 95
- circular correlation/convolution
- fixed encoding/decoding with fixed dimensionality
(but potential loss of information) ? - necessity of chunking or clean-up for decoding ?
- Binary spatter codes Kanerva, 96
- binary operations, fixed dimensionality,
potential loss - necessity of chunking or clean-up for decoding ?
- RAAM Pollack,90, LRAAM Sperduti, 94
- trainable networks, trained for the identity,
fixed dimensionality - encoding optimized for the given training set ?
45The general idea recursive distributed
representations
- Nevertheless results not promising ?
- Theorem Hammer
- There exists a fixed size neural network which
can uniquely encode tree structures of arbitrary
depth with discrete labels ? - For every code, decoding of all trees up to
height T requires O(2T) neurons for sigmoidal
networks ? - ? encoding seems possible, but no fixed size
architecture exists for decoding
46Tutorial Outline (Part I)Neural networks and
structured knowledge
- Feedforward networks
- The good old days KBANN and co.
- Useful neurofuzzy systems, data mining pipeline
- State of the art structure kernels
- Recurrent networks
- The basics Partially recurrent networks
- Lots of theory Principled capacity and
limitations - To do Challenges
- Recursive data structures
- The general idea recursive distributed
representations - One breakthrough recursive networks
- Going on towards more complex structures
47One breakthrough recursive networks
- Recursive networks Goller/Küchler, 96
- do not use decoding
- combine encoding and mapping
- train this combination directly for the given
task with backpropagation through structure - ? efficient data and problem adapted encoding is
learned
encoding
transformation
y
48One breakthrough recursive networks
- Applications
- term classification Goller, Küchler, 1996
- automated theorem proving Goller, 1997
- learning tree automata Küchler, 1998
- QSAR/QSPR problems Schmitt, Goller, 1998
Bianucci, Micheli, Sperduti, Starita, 2000
Vullo, Frasconi, 2003 - logo recognition, image processing Costa,
Frasconi, Soda, 1999, Bianchini et al. 2005 - natural language parsing Costa, Frasconi, Sturt,
Lombardo, Soda, 2000,2005 - document classification Diligenti, Frasconi,
Gori, 2001 - fingerprint classification Yao, Marcialis, Roli,
Frasconi, Pontil, 2001 - prediction of contact maps Baldi, Frasconi,
Pollastri, Vullo, 2002 - protein secondary structure prediction Frasconi
et al., 2005
49One breakthrough recursive networks
Desired approximation completeness - for every
(reasonable) function f and egt0 exists a RecNN
which approximates f up to e (with appropriate
distance measure)
- Approximation properties can be measured in
several ways - given f, e, probability P, data points xi, find
fw such that - P(x f(x)-fw(x) gt e ) small (L1 norm) or
- f(x)-fw(x) lt e for all x (max norm) or
- f(xi) fw(xi) for all xi (interpolation of
points)
50One breakthrough recursive networks
- Approximation properties for RecNNs and
tree-structured data - ? capable of approximating every continuous
function in max-norm with restricted height,
every measurable function in L1-norm
(ssquashing) Hammer - ? capable of interpolating every set
f(x1),,f(xm) with O(m2) neurons (ssquashing,
C2 in environment of t s.t. s(t)?0) Hammer - ? can approximate every tree automaton for
arbitrary large inputs Küchler - ? ... cannot approximate every f12?0,1
(for realistic s) Hammer - fairly good results - 31 ?
51Tutorial Outline (Part I)Neural networks and
structured knowledge
- Feedforward networks
- The good old days KBANN and co.
- Useful neurofuzzy systems, data mining pipeline
- State of the art structure kernels
- Recurrent networks
- The basics Partially recurrent networks
- Lots of theory Principled capacity and
limitations - To do Challenges
- Recursive data structures
- The general idea recursive distributed
representations - One breakthrough recursive networks
- Going on towards more complex structures
52Going on towards more complex structures
- More general trees
- arbitrary number of not positioned children
fencf(1/ch ?ch w fenc(ch)
labeledge,label)
approximation complete for appropriate edge
labels Bianchini et al. 2005
53Going on towards more complex structures
Baldi,Frasconi,,2002
54Going on towards more complex structures
q1
Contextual cascade correlation Micheli,Sperduti,0
3 Approximation complete (under a mild
structural restriction) even for structural
transduction Hammer,Micheli,Sperduti,05
55Going on towards more complex structures
neighbor
Micheli,05
56Conclusions recursive networks
- Very promising neural architectures for direct
processing of tree structures ? - Successful applications and mathematical
background ? - Connections to symbolic mechanisms (tree
automata) ? - Extensions to more complex structures (graphs)
are under development ? - Only few approaches which achieve structured
outputs ?
57Tutorial Outline (Part I)Neural networks and
structured knowledge
- Feedforward networks
- The good old days KBANN and co.
- Useful neurofuzzy systems, data mining pipeline
- State of the art structure kernels
- Recurrent networks
- The basics Partially recurrent networks
- Lots of theory Principled capacity and
limitations - To do Challenges
- Recursive data structures
- The general idea recursive distributed
representations - One breakthrough recursive networks
- Going on towards more complex structures
58Conclusions (Part I)
- Overview literature
- FNN and rules Duch,Setiono,Zurada,Computational
intelligence methods for understanding of data,
Proc. of the IEEE 92(5)771- 805, 2004 - Structure kernels Gärtner,Lloyd,Flach, Kernels
and distances for structured data, Machine
Learning, 57, 2004. (new overview is forthcoming) - RNNs Hammer,Steil, Perspectives on learning with
recurrent networks, in Verleysen, ESANN'2002,
D-side publications, 357-368, 2002 - RNNs and rules Jacobsson, Rule extraction from
recurrent neural networks a taxonomy and review,
Neural Computation, 171223-1263, 2005 - Recursive representations Hammer, Perspectives
on Learning Symbolic Data with Connectionistic
Systems, in Kühn, Menzel, Menzel, Ratsch,
Richter, Stamatescu, Adaptivity and Learning,
141-160, Springer, 2003. - Recursive networks Frasconi,Gori,Sperduti, A
General Framework for Adaptive Processing of Data
Structures, IEEE Transactions on Neural Networks,
9(5)768-786,1998 - Neural networks and structures Hammer,Jain,
Neural methods for non-standard data, in
Verleysen, ESANN'2004, D-side publications,
281-292, 2004
59Conclusions (Part I)
- There exist networks which can directly deal with
structures (sequences, trees, graphs) with good
success kernel machines, recurrent and recursive
networks - Efficient training algorithms and theoretical
foundations exist - (Loose) connections to symbolic processing have
been established and indicate benefits - Now towards strong connections
- ? PART II Logic and neural networks
60(No Transcript)