Comparing Convolution Kernels and Recursive Neural Networks for Learning Preferences on Structured Data - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Comparing Convolution Kernels and Recursive Neural Networks for Learning Preferences on Structured Data

Description:

ANNPR 2003, Florence 12-13 September 2003. 6. Predictive Toxicology, QSAR: ... ANNPR 2003, Florence 12-13 September 2003. 12. A Kernel for Labeled Trees. Feature Space ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 37
Provided by: dsiU
Category:

less

Transcript and Presenter's Notes

Title: Comparing Convolution Kernels and Recursive Neural Networks for Learning Preferences on Structured Data


1
Comparing Convolution Kernelsand Recursive
Neural Networks for Learning Preferences on
Structured Data
  • Sauro Menchetti, Fabrizio Costa, Paolo Frasconi
  • Department of Systems and Computer Science
  • Università di Firenze, Italy
  • http//www.dsi.unifi.it/neural/
  • Massimiliano Pontil
  • Department of Computer Science
  • University College London, UK

2
Structured Data
  • Many applications
  • is useful to represent the objects of the
    domain by structured data (trees, graphs, )
  • better capture important relationships between
    the sub-parts that compose an object

3
Natural Language Parse Trees
S
VP
NP
ADVP
NP
PRP
VBD
RB
NN
NN
.
He
was
vice
previous
president
.
4
Structural GenomicsProtein Contact Maps
5
Document Processing XY-Trees
6
Predictive Toxicology, QSARChemical Compounds
as Graphs
7
Ranking vs. Preference
1
Ranking
5
3
2
4
Preference
8
Preference on Structured Data
9
Classification, Regression and Ranking
  • Supervised learning task
  • fX?Y
  • Classification
  • Y is a finite unordered set
  • Regression
  • Y is a metric space (reals)
  • Ranking and Preference
  • Y is a finite ordered set
  • Y is a non-metric space

10
Learning on Structured Data
  • Learning algorithms on discrete structures often
    derive from vector based methods
  • Both Kernel Machines and RNNs are suitable for
    learning on structured domains

11
Kernels vs. RNNs
  • Kernel Machines
  • Very high-dimensional feature space
  • How to choose the kernel?
  • prior knowledge, fixed representation
  • Minimize a convex functional (SVM)
  • Recursive Neural Networks
  • Low-dimensional space
  • Task-driven representation depends on the
    specific learning task
  • Learn an implicit encoding of relevant
    information
  • Problem of local minima

12
A Kernel for Labeled Trees
  • Feature Space
  • Set of all tree fragments (subtrees) with the
    only constraint that a father can not be
    separated from his children
  • Fn(t) occurences of tree fragment n in t
  • Bag of something
  • A tree is represented by
  • F(t) F1(t),F2(t),F3(t),
  • K(t,s) F(t)F(s) is computed efficiently by
    dynamic programming (Collins Duffy, NIPS 2001)

13
Recursive Neural Networks
  • Composition of two adaptative functions
  • f transition function
  • o output function
  • f,o functions are implemented by feedforward NNs
  • Both RNN parameters and representation vectors
    are found by maximizing the likelihood of
    training data

14
Recursive Neural Networks
Network Unfolding
Prediction Phase
Error Correction
Labeled Tree
output network
A
C
E
D
B
B
15
Preference Models
  • Kernel Preference Model
  • Binary classification of pairwise differences
    between instances
  • RNNs Preference Model
  • Probabilistic model to find the best alternative
  • Both models use an utility function to evaluate
    the importance of an element

16
Utility Function Approach
  • Modelling of the importance of an object
  • Utility function UX?R
  • xgtz ? U(x)gtU(z)
  • If U is linear
  • U(x)gtU(z) ? wTxgtwTz
  • U can be also model by a neural network
  • Ranking and preference problems
  • Learn U and then sort by U(x)

U(x)11
U(z)3
17
Kernel Preference Model
  • x1 best of (x1,,xr)
  • Create a set of pairs between x1 and x2,,xr
  • Set of constraints if U is linear
  • U(x1)gtU(xj) ? wTx1gtwTxj ? wT(x1-xj)gt0 for j2,,r
  • x1-xj can be seen as a positive example
  • Binary classification of differences between
    instances
  • x ? F(x) the process can be easily kernelized
  • Note this model does not take into consideration
    all the alternatives together, but only two by two

18
RNNs Preference Model
  • Set of alternatives (x1,x2,,xr)
  • U modelled by a recursive neural network
    architecture
  • Compute U(xi) o(f(xi)) for i1,,r
  • Softmax function
  • The error (yi - oi) is backpropagated through
    whole network
  • Note the softmax function compares all the
    alternatives together at once

19
Learning Problems
  • First Pass Attachment
  • Modeling of a psycolinguistic phenomenon
  • Reranking Task
  • Reranking the parse trees output by a statistical
    parser

20
First Pass Attachment (FPA)
4
  • The grammar introduces some ambiguities
  • A set of alternatives for each word but only one
    is correct
  • The first pass attachment can be modelled as a
    preference problem

21
Heuristics forPrediction Enhancement
  • Specializing the FPA prediction for each class of
    word
  • Group the words in 10 classes (verbs, articles,
    )
  • Learn a different classifier for each class of
    words
  • Removing nodes from the parse tree that arent
    important for choosing between different
    alternatives
  • Tree reduction

22
Experimental Setup
  • Wall Street Journal (WSJ) Section of Penn
    TreeBank
  • Realistic Corpus of Natural Language
  • 40,000 sentences, 1 million words
  • Average sentence length 25 words
  • Standard Benchmark in Computational Linguistics
  • Training on sections 2-21, test on section 23 and
    validation on section 24

23
Voted Perceptron (VP)
  • FPA WSJ 100 million trees for training
  • Voted Perceptron instead of SVM (Freund
    Schapire, Machine Learning 1999)
  • Online algorithm for binary classification of
    instances based on perceptron algorithm (simple
    and efficient)
  • Prediction value weighted sum of all training
    weight vectors
  • Performance comparable to maximal-margin
    classifiers (SVM)

24
Kernel VP vs. RNNs
25
Kernel VP vs. RNNs
26
Kernel VP vs. RNNsModularization
27
Small Datasets No Modularization
28
Complexity Comparison
  • VP does not scale linearly with the number of
    training examples as the RNNs do
  • Computational cost
  • Small datasets
  • 5 splits of 100 sentences a week _at_ 2GHz CPU
  • CPU(VP) CPU(RNN)
  • Large datasets (all 40,000 sentences)
  • VP took over 2 months to complete an epoch _at_ 2GHz
    CPU
  • RNN learns in 1-2 epochs 3 days _at_ 2GHz CPU
  • VP is smooth in respect to training iterations

29
Reranking Task
Statistical Parser
  • Reranking problem rerank the parse trees
    generated by a statistical parser
  • Same problem setting of FPA (preference on
    forests)
  • 1 forest/sentence vs. 1 forest/word (less
    computational cost involved)

30
Evaluation Parseval Measures
  • Standard evaluation measure
  • Labeled Precision (LP)
  • Labeled Recall (LR)
  • Crossing Brackets (CBs)
  • Compare a parse from a parser with an hand
    parsing of a sentence

31
Reranking Task
32
Why RNNs outperform Kernel VP?
  • Hypothesis 1
  • Kernel Function feature space not focused on the
    specific learning task
  • Hypothesis 2
  • Kernel Preference Model worst than RNNs
    preference model

33
Linear VP on RNN Representation
  • Checking hypothesis 1
  • Train VP on RNN representation
  • The tree kernel replaced by a linear kernel
  • State vector representation of parse trees
    generated by RNN as input to VP
  • Linear VP is trained on RNN state vectors

34
Linear VP on RNN Representation
35
Conclusions
  • RNNs show better generalization properties
  • also on small datasets
  • at smaller computational cost
  • The problem is
  • neither the kernel function
  • nor the VP algorithm
  • Reasons linear VP on RNN representation
    experiment
  • The problem is
  • the preference model!
  • Reasons kernel preference model does not take
    into consideration all the alternatives together,
    but only two by two as opposed to RNN

36
Acknowledgements
  • Thanks to
  • Alessio Ceroni Alessandro Vullo
  • Andrea Passerini Giovanni Soda
Write a Comment
User Comments (0)
About PowerShow.com