Title: Lecture 10 protein structure prediction
1Lecture 10 protein structure prediction
2A protein sequence
3A protein sequence
- gtgi22330039refNP_683383.1 unknown protein
protein id At1g45196.1 Arabidopsis thaliana - MPSESSYKVHRPAKSGGSRRDSSPDSIIFTPESNLSLFSSASVSVDRCSS
TSDAHDRDDSLISAWKEEFEVKKDDESQNL - DSARSSFSVALRECQERRSRSEALAKKLDYQRTVSLDLSNVTSTSPRVVN
VKRASVSTNKSSVFPSPGTPTYLHSMQKGW - SSERVPLRSNGGRSPPNAGFLPLYSGRTVPSKWEDAERWIVSPLAKEGAA
RTSFGASHERRPKAKSGPLGPPGFAYYSLY - SPAVPMVHGGNMGGLTASSPFSAGVLPETVSSRGSTTAAFPQRIDPSMAR
SVSIHGCSETLASSSQDDIHESMKDAATDA - QAVSRRDMATQMSPEGSIRFSPERQCSFSPSSPSPLPISELLNAHSNRAE
VKDLQVDEKVTVTRWSKKHRGLYHGNGSKM - RDHVHGKATNHEDLTCATEEARIISWENLQKAKAEAAIRKLEKYFPQMKL
EKKRSSSMEKIMRKVKSAEKRAEEMRRSVL - DNRVSTASHGKASSFKRSGKKKIPSLSGCFTCHVF
4Protein Structure
Heparin docking Red heparin blue central
domain Yellow C-terminal domain
5A Protein Structure
beta-sheet
alpha-helix
loop
core
6Domain and Folds
- A discrete portion of a protein assumed to fold
independently of the rest of the protein and
possessing its own function. - Most proteins have multi-domains.
- The core 3D structure of a domain is called a
fold. There are only a few thousand possible
folds.
7Protein Similarity Level
- Family
- The proteins in the same family are homologous at
the sequence level. - Super Family
- all members of the super family should have the
same overall domain architecture, i.e., the same
domains in the same order - Fold
- The folds of two domains are similar.
8Protein Folding Problem
- A protein folds into a unique 3D structure
under the physiological condition. - Lysozyme sequence
- KVFGRCELAA AMKRHGLDNY
- RGYSLGNWVC AAKFESNFNT
- QATNRNTDGS TDYGILQINS
- RWWCNDGRTP GSRNLCNIPC
- SALLSSDITA SVNCAKKIVS
- DGNGMNAWVA WRNRCKGTDV
- QAWIRGCRL
9Relevance of Protein Structurein the Post-Genome
Era
structure
medicine
sequence
function
10Structure-Function Relationship
- Certain level of function can be found
without structure. But a structure is a key to
understand the detailed mechanism. - A predicted structure is a powerful tool for
function inference.
Trp repressor as a function switch
11Structure-Based Drug Design
- Structure-based rational drug design is still
a major method for drug discovery.
HIV protease inhibitor
12Protein Structure Prediction
- Structure
- Traditional experimental methods
- X-Ray or NMR to solve structures
- generate a few structures per day worldwide
- cannot keep pace for new protein sequences
- Strong demand for structure prediction
- more than 30,000 human genes
- 10,000 genomes will be sequenced in the next 10
years. - Unsolved problem after efforts of two decades.
13Ab initio Structure Prediction
-
- An energy function to describe the protein
- bond energy
- bond angle energy
- dihedral angel energy
- van der Waals energy
- electrostatic energy
- Minimize the function and obtain the structure.
- Not practical in general
- Computationally too expensive
- Accuracy is poor
14Template-Based Prediction
- Structure is better conserved than sequence
- Structure can adopt a
- wide range of mutations.
- Physical forces favor
- certain structures.
- Number of fold is limited.
- Currently 700
- Total 1,000 10,000 TIM
barrel
15Scope of the Problem
- 90 of new globular proteins share similar folds
with known structures, implying the general
applicability of comparative modeling methods for
structure prediction - general applicability of template-based modeling
methods for structure prediction (currently
60-70 of new proteins, and this number is
growing as more structures being solved) - NIH Structural Genomics Initiative plans to
experimentally solve 10,000 unique structures
and predict the rest using computational methods
16Homology Modeling
- Sequence is aligned with sequence of known
structure, usually sharing sequence identity of
30 or more. - Superimpose sequence onto the template, replacing
equivalent sidechain atoms where necessary. - Refine the model by minimizing an energy
function. - Applicable to 20 of all proteins.
17Concept of Threading
- Thread (align or place) a query protein sequence
onto a template structure in optimal way - Good alignment gives approximate backbone
structure
Query sequence MTYKLILNGKTKGETTTEAVD
AATAEKVFQYANDNGVDGEWTYTE Template set
Prediction accuracy fold recognition / alignment
184 Components of Threading
- Template library
- Scoring function
- Alignment
- Confidence assessment
19Core of a Template
Core secondary structures a-helices and
b-strands
20Definition of Template
-
- Residue type / profile
- Secondary structure type
- Solvent assessibility
- Coordinates for Ca / Cb
- RES 1 G 156 S 23 10.528 -13.223 9.932
11.977 -12.741 10.115 - RES 5 P 157 H 110 12.622 -17.353 10.577
12.981 -16.146 11.485 - RES 5 G 158 H 61 17.186 -15.086 9.205
16.601 -15.457 10.578 - RES 5 Y 159 H 91 16.174 -10.939 12.208
16.612 -12.343 12.727 - RES 5 C 160 H 8 12.670 -12.752 15.349
14.163 -13.137 15.545 - RES 1 G 161 S 14 15.263 -17.741 14.529
15.022 -16.815 15.733
21Energy (Score) Function
YKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEW
Pairwise energy How preferable to put two
particular residues nearby E_p
Singleton energy How well a residue fits a
template position (sequence and structural
environment) E_s
Alignment gap penalty E_g
Total energy E_p E_s E_g
22Threading problem
- Threading Given a sequence, and a fold
(template), compute the optimal alignment score
between the sequence and the fold. - If we can solve the above problem, then
- Given a sequence, we can try each known fold, and
find the best fold that fits this sequence. - Because there are only a few thousands folds, we
can find the correct fold for the given sequence. - Threading is NP-hard.
23Computational Methods
- Branch and Bound.
- Integer Program.
- Use linear programming plus branch and bound.
24(No Transcript)
25Blue Gene
- On December 6, 1999, IBM announced a 100 million
research initiative to build the world's fastest
supercomputer, "Blue Gene", to tackle fundamental
problems in computational biology. - More than one petaflop/s (1,000,000,000,000,000
floating point operations per second)
26(No Transcript)