Title: Prediction Enhancement of ProteinWater Binding Conservation through Evolutionary Computation
1Prediction Enhancement of Protein-Water Binding
Conservation through Evolutionary Computation
- Michael Peterson, Travis Doom, Michael Raymer
- Abstract
- The design of drugs to combat various
diseases is an extremely expensive and
time-consuming process. Potentially,
computational ligand screening will reduce the
time and expense associated with drug lead
discovery. Correctly predicting sites of water
conservation on a protein surface can
significantly increase the accuracy of ligand
screening efforts. Traditional classification
methods make correct predictions with
approximately 60 accuracy. The goal of our
research is to improve prediction accuracy by
applying evolutionary computing (EC) to
traditional methods of data classification. We
present a method that improves accuracy by
applying EC feature selection and extraction
techniques to k-nearest neighbor and naïve Bayes
classifiers. In order to facilitate this
research, a versatile EC engine was developed in
Java. Despite Javas object oriented nature, few
general-purpose Java-based EC engines exist. Our
engine, with several unique features, will
therefore be useful to the EC community, and will
be available via the World Wide Web.
2I. Protein-Ligand Binding
- When a ligand binds to the surface of a protein,
water molecules near that position will either be
conserved or displaced. - Our goal is to accurately predict sites of water
conservation with high accuracy.
Protein surface
Water molecule
Ligand
3II. Protein-Water Measurements
- Within a Set of 30 Proteins, There Are
- 8 Features Measured for Each Water Molecule
- Temperature factor (BVAL)
- Atomic Density (ADN)
- Atomic Hydrophilicity (AHP)
- Hydrogen bonds to protein (HBDP)
- Hydrogen bonds to water (HBDW)
- Mobility (MOB)
- ABVAL (Avg. B-val of protein atom neighbors)
- NBVAL (Net B-val of protein atom neighbors)
4III. Feature Weighted knn-classification
a.
Applying weights to measured features can improve
the accuracy of a k-nearest neighbor classifier.
Weights can be optimized by a genetic algorithm.
P
W
W
W
Feature 1
P
W
P
P
W
W
P
P
P
Feature 2
P
b.
W
W
W
Feature 1
P
W
P
W
W
P
P
P
P
Feature 2
Scale Extended
5IV. A Parameterized Discriminant for a Bayes
Classifier
- A confidence value for each class P is computed
for each class. - The class with the greatest value for P is
selected. - When C1 C2 .. Cd 1, the discriminant
function is equivalent to the naïve bayes
classifier.
The values for the coefficients, C1..Cd, are
supplied by an evolutionary computation optimizer.
6Va. A Simple Evolutionary Algorithm
Over a number of generations, the values in a
population of individuals are optimized via
operations abstracted from natural selection.
?
Increasing
Fitness
Evaluation
Selection
Genetic Operators
7V. GA/Classifier Hybrid Architecture
Mask Vector Masks used to hide features during
classification as a method of feature selection.
Genetic Algorithm
W1W2...W8 M1 M2...M8
KNN Classifier
W1W2...W8 M1 M2...M8
W1W2...W8 M1 M2...M8
Population of feature weight mask sets
W1W2...W8 M1 M2...M8
W2
...
...
W1
Fitness Based on the number of correct predictions
using the weight vector the number of masked
features
Weight Vector Weights to use for each
feature axis during classification.
8VI. Original EC Engine
- In order to optimize weights masks, we
implemented a new EC engine with several useful
features. - Chromosome consists of a feature vector and an
optional mask vector. Mask bits are used to
block features from being passed to the fitness
function.
F1 F2 Fn M1 M2 Mn
- Groups feature vector may consist of groups of
real, integer, or boolean values. Each group may
have its own mutation rate and reproduction
method. - Mutation Each group may have its own mutation
rate and method, and the mask vector has a
separate mutation rate and method. Random, range
based, and variance mutations are permitted. - Reproduction 1-point, 2-point, or uniform
crossover is permitted, either on a per group
basis, or across the entire feature chromosome.
These methods also apply to the mask chromosome.
9VII. Original EC Engine
- Our EC engine is implemented in Java, adding
versatility portability. - The user provides the fitness function, which
can be implemented in either Java or C/C. - Our engine is ideally suited for feature
selection extraction problems, such as this
protein-water binding example. - We use it to optimize feature weights and remove
unnecessary features for both knn bayes
classifiers. - The engine can be used for many problems which
genetic algorithms have been been applied in the
past.
10VIII. Conserved vs. Non-Conserved
0.000
0.000
0.000
0.000
0.000
0.683
59.83
60.68
60.25
3.40
2.48
69
0.000
0.000
0.000
0.336
0.000
0.000
0.000
0.664
This data is the result of previous work on
this problem by Mike Raymer, using a knn
classifier with a genetic algorithm implemented
by the University of California, San Diego.
11VIIIb. Determinants of Solvation
- A Continuum of Favorability
- Binding
- Atomic hydrophilicity
- Atomic density
- Hydrogen bonding potential
- Conservation
- Low B-Value, High Occupancy
- Many hydrogen bonds to protein atoms
12IX. Other EC Applications
- Structural Bioinformatics
- Prediction of metal binding sites
- Location of protein active sites
- Prediction of drug lead activity QSAR
- Gene Classification Discovery
- Prediction of gene function from microarray data
- Ligand Screening and Docking
- SLIDE, Volker Schnecke, MSU
- Crystallographic Solvent Fitting
- Xfit Duncan McRee, Scripps