Title: Diapositive 1
1A new protein-protein docking scoring function
based on Voronoï tessellation
Yeast Structural Genomics IBMMC CNRS UMR 8619
Université Paris-Sud Orsay LEBS CNRS UPR 9063
Bât 34 Gif sur Yvette
2Systematic searches for protein-protein complexes
in Yeast Thousands of protein-protein complexes
have been identified by genetic and biochemistry
techniques. Thousands of complexes
obtained in which a large part corresponds to
false positives. ? Need for efficient algorithms
to analyse those protein-protein complexes.
3Protein-protein docking algorithms the scoring
issue
Protein-protein docking procedure steps -
exploration - scoring In this study -
exploration is done using Dock from Wodak Janin
(1) - scoring procedure is done using no
biological information
(1) Janin J. Wodak S. Biopolymers 1985
4Voronoï tessellation
5Voronoï tessellation
6Voronoï tessellation as the dual of Delaunay
triangulation
7Voronoï tessellation
Complexity for 3D building - Naive algorithm
O(n3) - Incremental randomized algorithm (CGAL)
O(n2) Computation time 10000 points in 3D
less than a second
http//www.cgal.org
8Voronoi cells of amino acid residues
Two residues are Voronoï neighbors if their
Voronoï cells share a face.
Voronoï cell of a leucine (leu 64 - complex 1BTH)
Schematic view of Voronoï neighbourhood
definition
Neighbourhood definition is not based on distance.
9Building solvent sphere
Complex 1BTH and its "solvent sphere"
10Voronoï protein-protein interface definition
Voronoï interface
A residue is at the Voronoï interface between
two subunits if at least one of its Voronoï
neighbours belongs to another subunit and none of
its Voronoï neighbours is of solvent type.
11Measurements
12Distances
All distance distributions between cell
centroids at the interface were computed. Those
may be fitted on classical statistical
functions. Usually neighbourhood between
residues is defined from distance and cannot be
used as a parameter. Here neighborhood is defined
from Voronoï diagram and is not based on
distance.
Number
Number
Distance
Distance
Distance distribution between leucine and valine
at the interface
Distance distribution between lysine and aspartic
acid at the interface
13Searching for a scoring function
Need for a learning dataset - exhaustive
extraction of non-redundant binary
protein-protein complexes dataset on the whole
PDB release ? 102 binary complexes -
generation of non-natives conformations for
learning Validation - ten-fold
cross-validation for each machine learning
procedure
14Learning attributes
- Input variables
- - Voronoï interface area 1
- At the interface
- - number of amino acids 1
- - amino acid frequencies 20
- - mean Voronoï volumes 20
- - amino acids pair frequencies 210
- - mean pair distances 210
- Total 462 variables
15Grouping of input parameters
According to physico-chemical
properties 6 groups - 21 (67/2)
possible distances 84 (11202121) input
variables for learning procedures
16Machine learning procedures
Logistic function Support Vector Machines
(SVM) (wont be described here) ROc based
GEnetic LearneR (ROGER)
17 First Model Logistic Function
Parameters are discriminating
18ROC curve for logistic function
19Relative performance Logistic function and mean
square deviation
20ROGER (1) ROc based GEnetic learneR
- Genetic algorithm
- Goal learning a ranking function by optimization
of the area under the ROC curve (AUC) - Input set of putative complexes described by
values of measures of interest - Output a rank for each putative complex
indicating the relevance
(1) M. Roche, J. Azé, Y Kodratoff, M.Sebag,
ECAI04, 2004
21ROGER ROc based GEnetic learneR
Genetic algorithm
Initialisation
Parents (20 non-linear functions)
Replacement Best of 20200
Stop ? (fitness AUC, iterations...)
200 Children
Selection Based on fitness
Mutation Crossover
scoring function
- 21 independent runs - 10 fold Cross-Validation
22Relative performance ROGER and Logistic function
23The CAPRI Experiment
http//www.capri.ebi.ac.uk
24Procedure used on CAPRI targets
In this study - exploration is done using
Dock from Wodak Janin (1) and Haddock from
Dominguez, Boelens and Bonvin(2) - scoring
procedure is done using no biological
information - scoring function was obtained
with ROGER (3)
(1) J. Janin J S. Wodak (1985) Biopolymers
24509 (2) C. Dominguez, R Boelens and A. Bonvin
(2003) J. Am. Chem. Soc. 1251731 (3) M. Roche,
J. Azé, Y Kodratoff, M.Sebag (2004) ECAI04
25CAPRI Target 11 Cohesin-dockerin complex Rank
3 (Dock exploration)
Percentage of interface residues predicted 43
26CAPRI Target 11 Cohesin-dockerin complex Rank
1-5 (Dock exploration)
27CAPRI Target 11 Cohesin-dockerin complex Rank
1-5 (Dock exploration)
28CAPRI Target 11 Cohesin-dockerin complex Rank
4 (Haddock exploration)
Percentage of interface residues predicted 87
29CAPRI Target 11 Cohesin-dockerin complex Rank
1-5 (Haddock exploration)
30Conclusion
- To compare scoring performance, 10 classes were
defined depending of the fraction of native
contacts. - Dock 10 CAPRI targets analysed
-
- - For 9 targets (9-15,7-19) one of the best
class solution of the set was ranked number 1. - - For 1 target (16) solution ranked number 1 was
belonging to the 2nd best class. -
- Haddock 5 CAPRI targets analysed
-
- - For 3 targets (11,13,14) one of the best class
solution of the set was ranked in top 4. Top 50
contains very few false positives. - - For 2 targets (12,15) one of the best class
solution of the set was ranked in top 2. But top
50 contains false positives.
31Perspectives
Among all those targets, improve detection of
false positives in top 50. First results of
scoring function are very promising. Try other
fitness functions. Protein-protein complexes
atomic refinement needed. Address the ligand
problem.
32Acknowledgments
Anne Poupon Herman van Tilbeurgh Joël
Janin
Jérôme Azé
Aalt-Jan van Dijk Alexandre Bonvin
33 First Model Logistic Function
Research of the influence of the explicative
variables xi on the response variable Y, Y that
only have two possible values, 1 or 0 (i.e. true
or false) Influence of xi on success rate
Regression Logistic regression model
Each variable influence is known This
logistic function is equivalent to a perceptron,
i.e. a neural network without hidden
layer Estimation of variable weights by
maximum likelihood with R software
probability Cross-validation
34A second model
Measure of mean square deviations Each
variable xi measured on a complex is compared to
its mean value on true complexes reference
set. That means
35Recall and sensitivity definitions
36Volumes
Mean Voronoï volumes for the amino acids
Correlation between core Voronoï volumes and
Pontius volumes ?0.93 ? linear
relationship Mean Voronoï volumes of interface
cells are significantly larger than Voronoï
volumes of inner residues.
buried residues interface residues Pontius
Volumes (Ų)
Amino acid
Interface residues, buried residues and Pontius
(1) volumes of amino acids
(1) Pontius et al., J. Mol. Biol., 1996 Nov
22264(1)121-36