Title: PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES
1PREDICTION OF CATALYTIC RESIDUES IN PROTEINS
USING MACHINE-LEARNING TECHNIQUES
Natalia V. Petrova (Ph.D. Student, Georgetown
University, Biochemistry Department), Cathy H. Wu
(Protein Information Resource, Georgetown
University, Departments of Biochemistry and
Molecular Biology and Oncology )
SMO is the best performing algorithm (among
tested) for the prediction of catalytic residues
2
1
PRELIMINARY ATTRIBUTE SET
ABSTRACT
3
BENCHMARKING DATASET
We present a method for the prediction of
catalytic residues in proteins using machine
learning techniques. We found the best-performing
machine learning algorithm (support vector
classifier, SMO), and relevant features of
protein residues for the prediction of catalytic
residues using benchmarking dataset of enzymes
with known catalytic sites. This method can
predict catalytic residues and 3D location of the
active site with an accuracy gt 86 for proteins
with unknown function, provided that the
structure of the protein is known.
WRAPPER ATTRIBUTE SELECTION ALGORITHM
4
BEST-PERFORMING CLASSIFIER SMO
METHODS
In order to train a machine learning algorithm
we used the benchmarking dataset which is a
subset of the Catalytic Residue Dataset
database. Every protein from the benchmarking
dataset is a member of a manually curated protein
family of PIR iProClass database. The dataset has
254 catalytic residues from 79 proteins out of
178 enzymes from Catalytic Residues Dataset (1).
Using Catalytic Residue Database we
decided to build a dataset, where each instance
would be represented as a list of attribute
values and a class label 1 / -1, which in this
case would be an indicator of the residue being
catalytic (1) or not (-1). Each attribute in
this dataset is a property of the protein
residues. The list of attributes was chosen based
mostly on work of Bartlett et al., and other
authors who pointed out the importance of
particular residue property (2). Since for the
complex dataset it is almost impossible to know a
priory which classification algorithm is going to
perform better, our first goal was to determine
one of the best performing algorithms among
machine learning techniques built in WEKA,
JAVA-software package (3, 4). Different
authors seem to focus on different features of
the protein in order to predict catalytic
residues. Therefore, we found relevant features
of the protein residues for the prediction of
catalytic residues using our benchmarking dataset
of enzymes with known catalytic sites and machine
learning attribute selection algorithm
Wrapper (5). The selection of the
attributes combined with best-performing
algorithm was used to build a model for the
prediction of catalytic residues (6).
INTRODUCTION
One of the major goals of proteomics is to
assign a function to every protein. The knowledge
of the protein function is a key to determining
the role it plays in the cell. The number of
proteins, whose functions have been
experimentally characterized, is growing linearly
every year. Experimental data provide reliable
(in most cases) information about protein
functional residues as well as possible mechanism
of protein function. Furthermore, analytical
methods used for experimental characterization of
protein function involve many man-hours. It is
true that it can be reduced by either improving
the existing or, perhaps, by the development of
new methods in experimental biology. But, since
the sizes of the protein sequence and protein
structure databases are growing exponentially,
the gap between experimentally characterized and
uncharacterized proteins is also growing
exponentially. As a result, two major groups of
computational methods are progressively
developing homology transfer of known
experimental data, and prediction of protein
function using various properties of proteins and
amino acids. Prediction of the functional
residues is a challenging and interesting task.
The results of such prediction could be
successfully used in many research areas such as
drug design, experimental biology, and protein
database annotations.
CONCLUSIONS
FINAL ATTRIBUTE SET
5
The performance of a support vector classifier
suggests that the linear separation using one
dimension, corresponding to one feature, is not
sufficient for the prediction of catalytic
residues. Reduction of the number of the
attributes increases the prediction accuracy of
SMO algorithm 8 out of 24 attributes are
selected as relevant for the prediction of
catalytic residues SMO algorithm trained on the
dataset, represented by the selected attributes
has Prediction Accuracy
gt 86 TP Rate 0.898
FP Rate 0.126
Model
6
EXAMPLES OF PREDICTION
REFERENCES
Acetyl-coA Acetyltransferase, 1afw
Acylphosphatase, 2acy
- GenBank database statistics,
http//www.ncbi.nlm.nih.gov/Genbank/genbankstats.h
tml - PDB database statistics, http//www.rcsb.org/pd
b/holdings.html - Bartlett G.J., Porter C.T., Borkakoti N.,
Thornton J.M. Analysis of Catalytic Residues in
Enzyme Active Sites. J. Mol. Biol., 324 105-121,
2002 - Campbell S. J., Gold N. D., Jackson R. M.,
Westhead D. R., Ligand binding functional site
location, similarity and docking. Current Opinion
in Structural Biology, 13 389-395, 2003 - Sjolander K., Karplus K., Brown M., Hughey R.,
Krogh A., Mian S., Haussler D., Dirichlet
Mixtures A Method for Improved Detection of Weak
but Significant Protein Sequence Homology, 1996 - Smith D. K., Radivojac P., Obradovic Z., A.
Keith Dunker A. K., Zhu G., Improved amino acid
flexibility parameters. Protein Science, 12
1060-1072, 2003
True Positive (TP) red False Positive (FP)
yellow
True Positive (TP) red False Positive (FP)
blue
RESULTS
SMO (the support vector classifier) found to be
the best performing algorithm (among tested) for
the prediction of catalytic residues (4). 8
attributes out of 24 were selected as relevant.
As anticipated, the selection of the attributes
did improve the performance of the SMO classifier
(5). We measured the algorithm accuracy of
prediction without each individual attribute
present and found that no attribute can be
excluded from the final list without reduction in
the performance of SMO classifier (5).
Catalytic Residues C125, H375, C403, G405
Catalytic Residues R23, N41
ACKNOWLEDGEMENTS
- This work would not have been complete without
the wise help, and guidance that was provided by
our colleagues at PIR - Hongzhan Huang, Ph.D.
- (PIR Team Lead, Bioinformatics and
Research Assistant Professor) - Sona Vasudevan,Ph.D.
- (PIR Senior Bioinformatics Scientist)
- C.R. Vinayaka, Ph.D. (PIR Senior
Research Scientist)