PPT – PowerPoint presentation | free to download

About This Presentation

Title:

Description:

1 Department of Plant Systems Biology, VIB, 9052 Gent, Belgium. 2 Department of Molecular Genetics, Ghent University, 9052 Gent, Belgium ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 2

Provided by: Gro37

Category:

Tags: gent

more less

Transcript and Presenter's Notes

Title:

1
Applying supervised learning and feature
selection techniquesto extract genetic
interactions from text
Sofie Van Landeghem1,2, Yvan Saeys1,2 , Bernard
De Baets3, and Yves Van de Peer1,2
1 Department of Plant Systems Biology, VIB, 9052
Gent, Belgium 2 Department of Molecular Genetics,
Ghent University, 9052 Gent, Belgium 3 Department
of Applied Mathematics, Biometrics and Process
Control, Ghent University, 9000 Gent,
Belgium Contact sofie.vanlandeghem_at_psb.ugent.be
To extract protein interactions from online
literature resources, we have made use of a
dependency parser to analyse natural language.
For each input sentence, the resulting dependency
graph offers meaningful features which are used
in a supervised learning framework to detect
sentences expressing an interaction between two
proteins. We extracted both lexical and syntactic
information from the dependency graphs, capturing
the semantics of the input sentences with rich
feature vectors. Using these explicit feature
vectors, we applied a wide range of classifiers
such as Bayesian networks, SVMs and decision
trees. On this poster, performance results are
reported for the LibSVM implementation by WEKA.
Furthermore, our technique was evaluated on
several large-scale cross-dataset experiments,
offering a more realistic view on model
performance. Finally, we have shown the
usefulness of applying feature selection to our
problem in order to produce more cost-effective
models.
2. Dependency graphs
1. Free text

The results show that myogenin heterodimerizes
with E12 and E47 in vivo, but it does not
homodimerize to a measurable extent.
3. Feature extraction

Bag-of-words features
Blinding of protein names
Stemming of words
Length of shortest path between two proteins
Lexical syntactic information of root node
Lexical syntactic vertex walks heterodimer
nsubj PROT VBZ nsubj PROT
heterodimer prep_with PROT VBZ prep_with PROT
Lexical syntactic edge walks nsubj
heterodimer prep_with nsubj VBZ prep_with

4. Feature Selection

Highdimensional feature sets ? dimensionality
reduction
Filter method gain ratio
Independent of classifier
Rank features according to estimated performance

6. Evaluation

5. Classifier

Linear support vector machine (SVM)
LibSVM implementation by Weka
Internal 5-fold CV loop to determine
optimal c-parameter

4 cross-dataset experiments, training on 3
corpora, testing on the fourth
4 cross-validation experiments, each on a
different corpus

Conclusions
We have developed a method to extract
protein-protein interactions from text, using
rich feature vectors.
Features were mainly derived from dependency
graphs, which offer an analysis of syntactic
structure of a sentence
Feature selection leads to more cost-effective
models, while reducing the risk of overfitting
For the first time in this field, we have
conducted cross-dataset evaluations. These
experiments offer a more realistic view on
extraction performance.