Title: Comparing Convolution Kernels and Recursive Neural Networks for Learning Preferences on Structured Data
1Comparing Convolution Kernelsand Recursive
Neural Networks for Learning Preferences on
Structured Data
- Sauro Menchetti, Fabrizio Costa, Paolo Frasconi
- Department of Systems and Computer Science
- Università di Firenze, Italy
- http//www.dsi.unifi.it/neural/
- Massimiliano Pontil
- Department of Computer Science
- University College London, UK
2Structured Data
- Many applications
- is useful to represent the objects of the
domain by structured data (trees, graphs, ) - better capture important relationships between
the sub-parts that compose an object
3Natural Language Parse Trees
S
VP
NP
ADVP
NP
PRP
VBD
RB
NN
NN
.
He
was
vice
previous
president
.
4Structural GenomicsProtein Contact Maps
5Document Processing XY-Trees
6Predictive Toxicology, QSARChemical Compounds
as Graphs
7Ranking vs. Preference
1
Ranking
5
3
2
4
Preference
8Preference on Structured Data
9Classification, Regression and Ranking
- Supervised learning task
- fX?Y
- Classification
- Y is a finite unordered set
- Regression
- Y is a metric space (reals)
- Ranking and Preference
- Y is a finite ordered set
- Y is a non-metric space
10Learning on Structured Data
- Learning algorithms on discrete structures often
derive from vector based methods
- Both Kernel Machines and RNNs are suitable for
learning on structured domains
11Kernels vs. RNNs
- Kernel Machines
- Very high-dimensional feature space
- How to choose the kernel?
- prior knowledge, fixed representation
- Minimize a convex functional (SVM)
- Recursive Neural Networks
- Low-dimensional space
- Task-driven representation depends on the
specific learning task - Learn an implicit encoding of relevant
information - Problem of local minima
12A Kernel for Labeled Trees
- Feature Space
- Set of all tree fragments (subtrees) with the
only constraint that a father can not be
separated from his children - Fn(t) occurences of tree fragment n in t
- Bag of something
- A tree is represented by
- F(t) F1(t),F2(t),F3(t),
- K(t,s) F(t)F(s) is computed efficiently by
dynamic programming (Collins Duffy, NIPS 2001)
13Recursive Neural Networks
- Composition of two adaptative functions
- f transition function
- o output function
- f,o functions are implemented by feedforward NNs
- Both RNN parameters and representation vectors
are found by maximizing the likelihood of
training data
14Recursive Neural Networks
Network Unfolding
Prediction Phase
Error Correction
Labeled Tree
output network
A
C
E
D
B
B
15Preference Models
- Kernel Preference Model
- Binary classification of pairwise differences
between instances - RNNs Preference Model
- Probabilistic model to find the best alternative
- Both models use an utility function to evaluate
the importance of an element
16Utility Function Approach
- Modelling of the importance of an object
- Utility function UX?R
- xgtz ? U(x)gtU(z)
- If U is linear
- U(x)gtU(z) ? wTxgtwTz
- U can be also model by a neural network
- Ranking and preference problems
- Learn U and then sort by U(x)
U(x)11
U(z)3
17Kernel Preference Model
- x1 best of (x1,,xr)
- Create a set of pairs between x1 and x2,,xr
- Set of constraints if U is linear
- U(x1)gtU(xj) ? wTx1gtwTxj ? wT(x1-xj)gt0 for j2,,r
- x1-xj can be seen as a positive example
- Binary classification of differences between
instances - x ? F(x) the process can be easily kernelized
- Note this model does not take into consideration
all the alternatives together, but only two by two
18RNNs Preference Model
- Set of alternatives (x1,x2,,xr)
- U modelled by a recursive neural network
architecture - Compute U(xi) o(f(xi)) for i1,,r
- The error (yi - oi) is backpropagated through
whole network - Note the softmax function compares all the
alternatives together at once
19Learning Problems
- First Pass Attachment
- Modeling of a psycolinguistic phenomenon
- Reranking Task
- Reranking the parse trees output by a statistical
parser
20First Pass Attachment (FPA)
4
- The grammar introduces some ambiguities
- A set of alternatives for each word but only one
is correct - The first pass attachment can be modelled as a
preference problem
21Heuristics forPrediction Enhancement
- Specializing the FPA prediction for each class of
word - Group the words in 10 classes (verbs, articles,
) - Learn a different classifier for each class of
words - Removing nodes from the parse tree that arent
important for choosing between different
alternatives - Tree reduction
22Experimental Setup
- Wall Street Journal (WSJ) Section of Penn
TreeBank - Realistic Corpus of Natural Language
- 40,000 sentences, 1 million words
- Average sentence length 25 words
- Standard Benchmark in Computational Linguistics
- Training on sections 2-21, test on section 23 and
validation on section 24
23Voted Perceptron (VP)
- FPA WSJ 100 million trees for training
- Voted Perceptron instead of SVM (Freund
Schapire, Machine Learning 1999) - Online algorithm for binary classification of
instances based on perceptron algorithm (simple
and efficient) - Prediction value weighted sum of all training
weight vectors - Performance comparable to maximal-margin
classifiers (SVM)
24Kernel VP vs. RNNs
25Kernel VP vs. RNNs
26Kernel VP vs. RNNsModularization
27Small Datasets No Modularization
28Complexity Comparison
- VP does not scale linearly with the number of
training examples as the RNNs do - Computational cost
- Small datasets
- 5 splits of 100 sentences a week _at_ 2GHz CPU
- CPU(VP) CPU(RNN)
- Large datasets (all 40,000 sentences)
- VP took over 2 months to complete an epoch _at_ 2GHz
CPU - RNN learns in 1-2 epochs 3 days _at_ 2GHz CPU
- VP is smooth in respect to training iterations
29Reranking Task
Statistical Parser
- Reranking problem rerank the parse trees
generated by a statistical parser - Same problem setting of FPA (preference on
forests) - 1 forest/sentence vs. 1 forest/word (less
computational cost involved)
30Evaluation Parseval Measures
- Standard evaluation measure
- Labeled Precision (LP)
- Labeled Recall (LR)
- Crossing Brackets (CBs)
- Compare a parse from a parser with an hand
parsing of a sentence
31Reranking Task
32Why RNNs outperform Kernel VP?
- Hypothesis 1
- Kernel Function feature space not focused on the
specific learning task - Hypothesis 2
- Kernel Preference Model worst than RNNs
preference model
33Linear VP on RNN Representation
- Checking hypothesis 1
- Train VP on RNN representation
- The tree kernel replaced by a linear kernel
- State vector representation of parse trees
generated by RNN as input to VP - Linear VP is trained on RNN state vectors
34Linear VP on RNN Representation
35Conclusions
- RNNs show better generalization properties
- also on small datasets
- at smaller computational cost
- The problem is
- neither the kernel function
- nor the VP algorithm
- Reasons linear VP on RNN representation
experiment - The problem is
- the preference model!
- Reasons kernel preference model does not take
into consideration all the alternatives together,
but only two by two as opposed to RNN
36Acknowledgements
- Thanks to
- Alessio Ceroni Alessandro Vullo
- Andrea Passerini Giovanni Soda