Title:
1Applying supervised learning and feature
selection techniquesto extract genetic
interactions from text
Sofie Van Landeghem1,2, Yvan Saeys1,2 , Bernard
De Baets3, and Yves Van de Peer1,2
1 Department of Plant Systems Biology, VIB, 9052
Gent, Belgium 2 Department of Molecular Genetics,
Ghent University, 9052 Gent, Belgium 3 Department
of Applied Mathematics, Biometrics and Process
Control, Ghent University, 9000 Gent,
Belgium Contact sofie.vanlandeghem_at_psb.ugent.be
To extract protein interactions from online
literature resources, we have made use of a
dependency parser to analyse natural language.
For each input sentence, the resulting dependency
graph offers meaningful features which are used
in a supervised learning framework to detect
sentences expressing an interaction between two
proteins. We extracted both lexical and syntactic
information from the dependency graphs, capturing
the semantics of the input sentences with rich
feature vectors. Using these explicit feature
vectors, we applied a wide range of classifiers
such as Bayesian networks, SVMs and decision
trees. On this poster, performance results are
reported for the LibSVM implementation by WEKA.
Furthermore, our technique was evaluated on
several large-scale cross-dataset experiments,
offering a more realistic view on model
performance. Finally, we have shown the
usefulness of applying feature selection to our
problem in order to produce more cost-effective
models.
2. Dependency graphs
1. Free text
The results show that myogenin heterodimerizes
with E12 and E47 in vivo, but it does not
homodimerize to a measurable extent.
3. Feature extraction
- Bag-of-words features
- Blinding of protein names
- Stemming of words
- Length of shortest path between two proteins
- Lexical syntactic information of root node
- Lexical syntactic vertex walks heterodimer
nsubj PROT VBZ nsubj PROT - heterodimer prep_with PROT VBZ prep_with PROT
- Lexical syntactic edge walks nsubj
heterodimer prep_with nsubj VBZ prep_with
4. Feature Selection
- Highdimensional feature sets ? dimensionality
reduction - Filter method gain ratio
- Independent of classifier
- Rank features according to estimated performance
6. Evaluation
5. Classifier
- Linear support vector machine (SVM)
- LibSVM implementation by Weka
- Internal 5-fold CV loop to determine
optimal c-parameter
4 cross-dataset experiments, training on 3
corpora, testing on the fourth
4 cross-validation experiments, each on a
different corpus
- Conclusions
- We have developed a method to extract
protein-protein interactions from text, using
rich feature vectors. - Features were mainly derived from dependency
graphs, which offer an analysis of syntactic
structure of a sentence - Feature selection leads to more cost-effective
models, while reducing the risk of overfitting - For the first time in this field, we have
conducted cross-dataset evaluations. These
experiments offer a more realistic view on
extraction performance.