Title: Computational Protein Design: A problem in combinatorial optimization
1Computational Protein DesignA problem in
combinatorial optimization
- CSE 549 Guest Lecture
- September 17, 2009
- David Green
- Applied Mathematics Statistics
2What is a protein?
- Polymers (chains) of amino acids.
- There are 20 different amino acids that can be
part of the chain. - Machines of the cell.
- Its proteins that do most of the work involved
in life!
3Polymers of amino acids.
- Amino acids link to form polypeptides.
- There is a backbone of constant composition.
- There are side chains that vary.
4The twenty amino acids.
- AA side chains vary from
- Big to small.
- Non-polar (all C and H) to polar.
- Positive to negative.
- Flexible to rigid.
5The machinery of life.
- Protein sensors (receptors) are responsible for
all the senses (sight, smell, taste, touch,
hearing). - Enzymes are proteins the catalyze chemical
reactions, like the ones that convert food to
energy. - Specialized structural proteins make skin
elastic, and make the lens of the eye work. - Muscles are primarily composed of proteins that
combine structural and enzymatic parts to make a
machine.
6Why design proteins?
- New sensors based on biology.
- Proteins have been engineered to detect TNT
(explosive) and sarin (nerve gas). - Proteins are used as treatments for many
diseases. - Protein engineering has helped improve proteins
that are given to cancer patients on radiation or
chemo-therapy. - Work in the Green lab is on-going to design
proteins for use as anti-HIV prophylatics. - Many nanotechnology applications that havent
even been considered yet!
7Where do proteins come from?
- The genome contains instructions for every
protein in a cell. - A few HUGE molecules of DNA.
- Each gene is the code for one protein.
- There are 30,000 genes in humans.
- Genes are expressed through an intermediate
molecule, RNA. - Many copies of each protein can be made.
8The Central Dogma of Molecular Biology.
- Then proteins do the work!
9How do proteins work?
- Proteins fold into a unique 3-dimensional
structure. - The amino acid sequence of a protein dictates
its structure. - The function of a protein is controlled by its
structure.
10Many polymers are long, unstructured chains.
- Polyethylene
- Is made of long chains of the same monomers.
- Adopts a random mesh of inter-weaving strands.
- This structure gives us PLASTIC!
11DNA has the same structure for every sequence.
- The double-helix is a great structure for
storing and replication information.
12Protein structures are well-defined and diverse!
- One chain or many.
- Elongated or globular.
- Many forms of symmetry (or none).
13What does a protein look like?
- Cyanovirin A protein that inhibits the entry of
HIV into human cell.
14What does a protein look like?
- The atoms of a protein form a compact,
well-packed cluster.
15What does a protein look like?
- A protein can be thought of as a nearly solid
object.
16What does a protein look like?
- Simplified cartoons make the structure easier to
see.
17What does a protein look like?
- The path of the backbone of a protein is called
its fold.
18What does a protein look like?
- Different types of amino acids are found all
along the protein chain.
19What does a protein look like?
- Each amino acid has a side chain that protrudes
from the backbone.
20What does a protein look like?
- Many proteins bind other molecules, like the
sugar molecules here.
21What does a protein look like?
- Binding interfaces are usually a close fit of two
complementary surfaces.
22What does a protein look like?
- The core of a protein is key in keeping a stable
structure.
23Many side chains fill the core.
24The core is well packed
25 with groups from all along the chain.
26Each side chain fits perfectly.
27What is a protein?
- A protein is a complicated three-dimensional
structure, made up by an amazing 3-D jigsaw
puzzle of interlocking amino acids. - Amino acids pack together not just geometrically,
but with complementary chemical groups as well. - Proteins move too, but well ignore that for now.
28How can we design one?!?
- Choose a fold (path of the backbone).
- Pack the core with the right set of amino acids
to achieve the desired fold. - Choose other amino acids to achieve the desired
function (such as binding to a target molecule,
or getting the right molecular motions).
29Structure prediction is a forward problem.
- Given a protein sequence, what is the structure
that it will adopt (fold to)? - This is a VERY hard problem, and it not yet fully
solved. - Prediction is difficult because you are stuck
with what nature gives you.
30Protein design is an inverse problem.
- Given a desired 3-dimensional protein structure,
what is a sequence that will fold to that
structure? - We have the freedom to add constraints that
simplify the problem. - As a result, methods for protein design have had
many successes.
- Pabo. Nature 301 200 (1981).
- Drexler. PNAS 78 5275-5278 (1981).
31A designed sequence should fold according to
design.
- ANY sequence which folds to the correct target
structure (and carries out the desired function)
can be considered a successful design - There is more than one right answer, unlike in
prediction!
32Choosing a backbone fold.
- The structure dictates the function, and a big
part of structure is the fold. - We still dont really know how to choose the
best fold. - Instead, we just borrow from nature redesign a
natural protein to do something new.
33Zinc finger proteins bind DNA.
34A Zinc ion holds them together.
- The protein will not fold if zinc is not present.
- The protein only binds DNA when it is folded.
- A group at Caltech set out to design a zinc
finger that doesnt need zinc!
351997 The first fully automated protein design!
- Dahiyat and Mayo. Science 287 82-87 (1997).
36Designing function.
- Making a molecule bind is like designing a the
core we want to make the interface between the
two pieces complementary. - Other functions are a lot trickier and we dont
have good ways to solve them yet, but were on
our way.
372003 A Duke group designs a set of protein
sensors.
- Looger, Dwyer, Smith and Hellinga. Nature 423
185-190 (2003).
38Protein design is a BIG problem.
- The zinc finger is one of the smallest protein
domains about 30 amino acids long. - How many different 30 amino acid polypeptides are
there? - Choose from any of 20 amino acids at each
position. - Total sequences 2030 1x1039
- Mass of earth 6x1027 g
- Mass of a grain of sand 1x10-3 g
- A billion earths worth of sand grains
- Enumeration of possible states is beyond
impossible must take advantage of need to
achieve complementary interactions between amino
acids.
39Many different structures are possible.
- An arginine and a glutamate interact.
40Many different structures are possible.
- An arginine and a glutamate interact.
41Many different structures are possible.
- An arginine and a glutamate interact.
42Many different structures are possible.
- An arginine and a glutamate interact in several
different conformations.
43Really Big!!!
- Amino-acid side chains are flexible.
- But not every shape (conformation) is equal.
- Each amino acid has a set of preferred
conformations (rotamers). - 1 to 80 per amino acid.
- Instead of choosing from 20 amino acids we need
to choose from 400 (at least) amino acid
rotamers! - Total structures 40030 1x1078
- (approx. number of atoms in the universe!!!!!)
44Packing side chains a puzzle.
- How do you solve a jigsaw puzzle?
- Impossible to try all combinations of piece
placement - Unique ways of placing N pieces on a grid is
(4N)(N!) - For N100, (1.6x1060)(9.3x10157) 1.5x10218
- Trying each piece one by one is better, but still
infeasible - Number of iterative tries for a N piece puzzle
is - For N100, 1.37x106
45Packing side chains be smart.
- How do you solve a jigsaw puzzle?
- Group pieces by colors and patterns.
- Iterate over matching of pieces that are
complementary - Shape is important.
- The pattern must also match.
46Pattern matching in proteins?
- What does it mean for two amino acids in the core
of a protein to match? - Must fit close together (but not too close) ?
Steric complementarity. - Neighboring atoms must have complementary charges
(neutral likes neutral, positive likes negative)
? Electrostatic complementarity.
47Steric fit Lennard-Jones potential.
- Van der Waals attraction between atoms at
moderate distances. - Repulsion of atoms from one another at short
distances. - If atoms are not nearby, the energy between them
will be very close to zero. - The total score of the goodness of fit in a
molecule is the sum of the energy for every pair
of atoms.
48Electrostatic fit Coulombs Law
- Atoms in molecules can be thought of as having
tiny charges on them, even if the total charge on
a molecule is zero. - Coulombs Law describes the energy of how two
charges interact. - The overall electrostatic fit is calculated by
adding up the energy of all pairs of atoms. - Like charges give a positive value.
- Opposite charges give a negative.
- Neutral (zero charge) groups dont matter.
49The total energy describes the fitness of a
structure.
- Van der Waals Coulombs Law, for every pair of
atoms, and all added up. - Negative energies are favorable, positive
energies unfavorable. - Nature works to MINIMIZE energy.
50Protein Design as a Discrete Conformational Search
Position 1
Position 2
Position 3
Conformational states of system
51Tree Pruning with Dead End Elimination
- Molecular Mechanics energy
- Van der Waals
- Coulombic
- Bond, angle, dihedral
i , j are positions in sequence Ri is rotamer
choice at position i
The Dead-end Elimination Theorem
Given two rotamer choices X, and Y at position I,
if the best energy of X (with any choice of
rotamers at other positions) is worse than the
worst energy of Y, then X can not be part of the
global energy minimum. Need to make the
comparison, but the min and max functions
require evaluating all states.
52Making DEE feasible.
Ri is used to replace Xi in the first equation.
Last sum is invariant with respect to our choice
at position i, and thus
But note that
This gives a sufficient condition for the DEE
theorem to hold
min and max are evaluated over rotamers at a
single position, so entirely feasible!
53Improving the bounds for DEE
Original problem statement was to compare the
best structure with Xi at position I to the worst
with Yi it is a easier to satisfy, but still
sufficient, criterion to find the single set of
choices at all other positions with the minimum
difference between choice Xi and Yi.
Again, we use the same trick to bound the minimum
in a feasible manner
This gives a alternate sufficient condition for
the DEE theorem to hold
This is a tighter bound on the true desired
comparison, since
54DEE as an iterative algorithm
As rotamers are flagged as incompatible with the
global minimum solution, the min/max functions
are evaluated over a smaller and smaller set of
choices, and so additional iterations of the
comparison can eliminate more possibilities.
Thus, the algorithm can be outlined as While
(any rotamers eliminated ) For each position
in the sequence (i) For each rotamer choice
(X) at position i For each rotamer choice
(Y ne X) at position i If DEE criterion
is satisfied Eliminate choice X at
position I
The order of each cycle is NR2, where N is the
number of positions, and R is the number of
choices at each position. However, as R
decreases with iteration, each subsequent cycle
costs less.
55DEE identifies branches that are incompatible
with the global minimum
Position 1
Position 2
Position 3
Conformational states of system
56The pruned tree can be much smaller
Position 1
Position 2
Position 3
Conformational states of system
57DEE can be highly effective
Example of a 5-position design, with 306 choices
per position, done in a simple MATLAB
script Iteration 0 306 306 306 306
306 Iteration 0 Structures 2.682916e12 Iterati
on 1 143 198 145 42 33 Iteration 1 Structures
5.690265e09 from 2.682916e12 Iteration 1
Elapsed time is 96.461053 seconds. Iteration 2
52 83 55 11 8 Iteration 2 Structures 20889440
from 5.690265e09 Iteration 2 Elapsed time is
11.604577 seconds. Iteration 3 4 12 36 6
5 Iteration 3 Structures 51840 from
20889440 Iteration 3 Elapsed time is 0.625266
seconds. Iteration 4 4 12 35 6 5 Iteration 4
Structures 50400 from 51840 Iteration 4
Elapsed time is 0.148598 seconds. Iteration 5 4
12 35 6 5 Iteration 5 Structures 50400 from
50400 Iteration 5 Elapsed time is 0.045631
seconds.
58DEE Notes and Caveats
Dead-end elimination is not guaranteed not to
eliminate any choices in this case
computational expense is used at zero gain.
However, experience suggests that in the case of
protein design, the algorithm is highly
efficient. For large design problems, even a
highly efficient pruning can leave a tree which
is too large to be searched by enumeration (such
as depth-first search) for example, consider an
original space of 10100 states, reduced to 1020.
An efficient bounding heuristic can be defined
using similar tricks as discussed here this can
be used in the A algorithm to find the global
minimum within the remaining space. The DEE
criterion can be extended to pairs, triples, and
n-tuples of positions. Most applications use
only singles and pairs.
59Optimization as a tree search.
How to choose a path through a tree, which the
goal of reaching the global minimum state?
First, define an energy to be associated with
each step through the tree
This is the energy of placing rotamer choice R at
position i, given that all positions in the tree
above i have been selected. Note this is similar,
but not identical, to the definition used with
DEE.
At the leaf node, the state energy is then the
sum of all individual path energies
Thus, we wish to find the path with the lowest
total.
60Efficient search using a heuristic (A)
Challenge is that we do not know the total until
we have traversed the tree to a leaf!
When choosing a path to take at a given node, we
know the path we have already taken, and thus the
true cost. We thus combine this information with
the heuristic in deciding what step to take
At a leaf node, the heuristic is 0, and this
gives the total energy.
However, if we only use the heuristic, how do we
know we get the correct solution?
61Guaranteed optimality with A
1. Allow backtracking.
At every step, not only choose between the
possible paths from the current node, but also
the paths from all nodes which have been visited
in the past. In this way, if the heuristic turned
out to be poor for a given path (and the true
energy became large), a new path is chosen.
2. Use a heuristic that bounds the true solution.
Consider a heuristic that is guaranteed to be
lower than the true best energy down a path (an
optimistic prediction of the best energy). When a
leaf is reached, a comparison between the true
energy of that leaf and the heuristic energy of
the un-followed choice can be made. Since the
heuristic is an optimistic guess, if the true
energy is lower than the heuristic for all other
choices, it must be the global minimum.
Thus, it is possible to have a guarantee that the
solution found is the true global minimum!
62Defining the bounding heuristic
Recall our energy definitions
The optimal heuristic is
Now, we bound the second term
63Overview of A for protein design
Initialize a list of traveled nodes with the
root. While (no mininum leaf in list) Select
minimum f of all paths from nodes in list. Add
this new node to list. If (new node is leaf)
Compare leaf energy to minimum f from list.
If (Leaf energy lt min(f)) Leaf is
global minimum.
Our final heuristic is
The second term involves a minimum over each
choice at following position, which is order
(N-i)R, with an inner minimum over the same,
order (N-j)R. Thus, the cost is less than
(N-j)2R2.
64A Notes and Caveats
Performance of A is not guaranteed in the
worst case, the entire tree must be enumerated to
find the solution. Again, however, experience
suggests that the algorithm is highly efficient
for protein design. In many cases, we wish not
only to have the global minimum, but all
solutions within a cutoff of the minimum. A
can be adapted to solve this problem as well, but
care must be taken in the first step of pruning
with DEE the elimination criterion must be
modified to prevent elimination of low, but not
minimum, energy states.
65References
- Protein design as an inverse problem
- Pabo. Nature 301 200 (1981).
- Drexler. PNAS 78 5275-5278 (1981).
- Examples of successful protein design
- Dahiyat and Mayo. Science 287 82-87 (1997).
- Looger, Dwyer, Smith and Hellinga. Nature 423
185-190 (2003). - Development of the Dead-End Elimination
algorithm - Desmet, DeMaeyer, Hazes and Lasters, Nature 356
539-542 (1992). - Goldstein, Biophys. J. 66 1335-1340 (1994).
- Gordon and Mayo. J. Comput. Chem. 19 1505-1514
(1998). - Formulation of A for protein design
- Leach and Lemon, Proteins 33 227-239 (1998).