Comparing Convolution Kernels and Recursive Neural Networks for Learning Preferences on Structured Data - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Comparing Convolution Kernels and Recursive Neural Networks for Learning Preferences on Structured Data

Description:

ANNPR 2003, Florence 12-13 September 2003. 6. Predictive Toxicology, QSAR: ... ANNPR 2003, Florence 12-13 September 2003. 12. A Kernel for Labeled Trees. Feature Space ... – PowerPoint PPT presentation

Number of Views:122

Avg rating:3.0/5.0

Slides: 37

Provided by: dsiU

Category:

more less

Transcript and Presenter's Notes

Title: Comparing Convolution Kernels and Recursive Neural Networks for Learning Preferences on Structured Data

1
Comparing Convolution Kernelsand Recursive
Neural Networks for Learning Preferences on
Structured Data

Sauro Menchetti, Fabrizio Costa, Paolo Frasconi
Department of Systems and Computer Science
Università di Firenze, Italy
http//www.dsi.unifi.it/neural/
Massimiliano Pontil
Department of Computer Science
University College London, UK

2
Structured Data

Many applications
is useful to represent the objects of the
domain by structured data (trees, graphs, )
better capture important relationships between
the sub-parts that compose an object

3
Natural Language Parse Trees
S
VP
NP
ADVP
NP
PRP
VBD
RB
NN
NN
.
He
was
vice
previous
president
.
4
Structural GenomicsProtein Contact Maps
5
Document Processing XY-Trees
6
Predictive Toxicology, QSARChemical Compounds
as Graphs
7
Ranking vs. Preference
1
Ranking
5
3
2
4
Preference
8
Preference on Structured Data
9
Classification, Regression and Ranking

Supervised learning task
fX?Y
Classification
Y is a finite unordered set
Regression
Y is a metric space (reals)
Ranking and Preference
Y is a finite ordered set
Y is a non-metric space

10
Learning on Structured Data

Learning algorithms on discrete structures often
derive from vector based methods

Both Kernel Machines and RNNs are suitable for
learning on structured domains

11
Kernels vs. RNNs

Kernel Machines
Very high-dimensional feature space
How to choose the kernel?
prior knowledge, fixed representation
Minimize a convex functional (SVM)
Recursive Neural Networks
Low-dimensional space
Task-driven representation depends on the
specific learning task
Learn an implicit encoding of relevant
information
Problem of local minima

12
A Kernel for Labeled Trees

Feature Space
Set of all tree fragments (subtrees) with the
only constraint that a father can not be
separated from his children
Fn(t) occurences of tree fragment n in t
Bag of something
A tree is represented by
F(t) F1(t),F2(t),F3(t),
K(t,s) F(t)F(s) is computed efficiently by
dynamic programming (Collins Duffy, NIPS 2001)

13
Recursive Neural Networks

Composition of two adaptative functions
f transition function
o output function
f,o functions are implemented by feedforward NNs
Both RNN parameters and representation vectors
are found by maximizing the likelihood of
training data

14
Recursive Neural Networks
Network Unfolding
Prediction Phase
Error Correction
Labeled Tree
output network
A
C
E
D
B
B
15
Preference Models

Kernel Preference Model
Binary classification of pairwise differences
between instances
RNNs Preference Model
Probabilistic model to find the best alternative
Both models use an utility function to evaluate
the importance of an element

16
Utility Function Approach

Modelling of the importance of an object
Utility function UX?R
xgtz ? U(x)gtU(z)
If U is linear
U(x)gtU(z) ? wTxgtwTz
U can be also model by a neural network
Ranking and preference problems
Learn U and then sort by U(x)

U(x)11
U(z)3
17
Kernel Preference Model

x1 best of (x1,,xr)
Create a set of pairs between x1 and x2,,xr
Set of constraints if U is linear
U(x1)gtU(xj) ? wTx1gtwTxj ? wT(x1-xj)gt0 for j2,,r
x1-xj can be seen as a positive example
Binary classification of differences between
instances
x ? F(x) the process can be easily kernelized
Note this model does not take into consideration
all the alternatives together, but only two by two

18
RNNs Preference Model

Set of alternatives (x1,x2,,xr)
U modelled by a recursive neural network
architecture
Compute U(xi) o(f(xi)) for i1,,r

Softmax function

The error (yi - oi) is backpropagated through
whole network
Note the softmax function compares all the
alternatives together at once

19
Learning Problems

First Pass Attachment
Modeling of a psycolinguistic phenomenon
Reranking Task
Reranking the parse trees output by a statistical
parser

20
First Pass Attachment (FPA)
4

The grammar introduces some ambiguities
A set of alternatives for each word but only one
is correct
The first pass attachment can be modelled as a
preference problem

21
Heuristics forPrediction Enhancement

Specializing the FPA prediction for each class of
word
Group the words in 10 classes (verbs, articles,
)
Learn a different classifier for each class of
words
Removing nodes from the parse tree that arent
important for choosing between different
alternatives
Tree reduction

22
Experimental Setup

Wall Street Journal (WSJ) Section of Penn
TreeBank
Realistic Corpus of Natural Language
40,000 sentences, 1 million words
Average sentence length 25 words
Standard Benchmark in Computational Linguistics
Training on sections 2-21, test on section 23 and
validation on section 24

23
Voted Perceptron (VP)

FPA WSJ 100 million trees for training
Voted Perceptron instead of SVM (Freund
Schapire, Machine Learning 1999)
Online algorithm for binary classification of
instances based on perceptron algorithm (simple
and efficient)
Prediction value weighted sum of all training
weight vectors
Performance comparable to maximal-margin
classifiers (SVM)

24
Kernel VP vs. RNNs
25
Kernel VP vs. RNNs
26
Kernel VP vs. RNNsModularization
27
Small Datasets No Modularization
28
Complexity Comparison

VP does not scale linearly with the number of
training examples as the RNNs do
Computational cost
Small datasets
5 splits of 100 sentences a week _at_ 2GHz CPU
CPU(VP) CPU(RNN)
Large datasets (all 40,000 sentences)
VP took over 2 months to complete an epoch _at_ 2GHz
CPU
RNN learns in 1-2 epochs 3 days _at_ 2GHz CPU
VP is smooth in respect to training iterations

29
Reranking Task
Statistical Parser

Reranking problem rerank the parse trees
generated by a statistical parser
Same problem setting of FPA (preference on
forests)
1 forest/sentence vs. 1 forest/word (less
computational cost involved)

30
Evaluation Parseval Measures

Standard evaluation measure
Labeled Precision (LP)
Labeled Recall (LR)
Crossing Brackets (CBs)
Compare a parse from a parser with an hand
parsing of a sentence

31
Reranking Task
32
Why RNNs outperform Kernel VP?

Hypothesis 1
Kernel Function feature space not focused on the
specific learning task
Hypothesis 2
Kernel Preference Model worst than RNNs
preference model

33
Linear VP on RNN Representation

Checking hypothesis 1
Train VP on RNN representation
The tree kernel replaced by a linear kernel
State vector representation of parse trees
generated by RNN as input to VP
Linear VP is trained on RNN state vectors

34
Linear VP on RNN Representation
35
Conclusions

RNNs show better generalization properties
also on small datasets
at smaller computational cost
The problem is
neither the kernel function
nor the VP algorithm
Reasons linear VP on RNN representation
experiment
The problem is
the preference model!
Reasons kernel preference model does not take
into consideration all the alternatives together,
but only two by two as opposed to RNN

36
Acknowledgements