Title: Basics of Homology Modeling
1Basics of Homology Modeling MPACK Tutorial
Ovidiu Ivanciuc Computational Biology Sealy
Center for Structural Biology, HBCG
2Protein Sequences vs. Protein Structures
Genbank - 28,507,990,166 bases in 22,318,883
sequences (January 2003) http//www.ncbi.nlm.nih.g
ov/ Swiss-Prot - 143,790 protein sequences
(release 42.9 of February 2, 2004) TrEMBL -
1,075,779 protein sequences (release 25.9 of
February 2, 2004) http//us.expasy.org/sprot/ PIR
- 1,483,754 non-redundant protein sequences
from PIR-PSD, Swiss-Prot, TrEMBL, RefSeq,
GenPept, and PDB (release 1.39, January 19,
2004) http//pir.georgetown.edu/ UniProt EBI
ExPASy PIR http//www.expasy.uniprot.org/index
.shtml PDB - 24168 structures (February 3,
2004) http//www.rcsb.org/pdb/
3Protein Tertiary Structure Prediction
Ab initio modeling uses a force field
(molecular mechanics) method and molecular
dynamics (or other global minimization
algorithms) to find the native state of a
protein Molecular mechanics computes the
molecular energy based on classical (Newtonian)
mechanics and considers molecules as atoms bonded
with elastic bonds Molecular Energy Bond
Energy Angle Energy Torsion Energy
Electrostatic Energy Hydrogen Bonds Energy
Solvation Energy SS Bridge Energy See
Folding_at_Home for a large scale, distributed
computing, protein structure ab initio
computing http//www.stanford.edu/group/pandegroup
/folding/
4Homology Modeling
A comparison of the shape of NMR and X-ray
structures from PDB shows that proteins can be
classified into families based on a limited set
of folding patterns. Also, proteins with the same
function generally have similar structures (and
exceptions are known). Usually, if two proteins
have an alignment with a sequence identity gt 30
they have similar folding pattern. These
empirical observations became the foundation of
homology modeling. Homology modeling for a
target sequence find a homologous PDB template
structure, make an optimum alignment between the
target and template sequences, and generate the
the tertiary structure of the target using the
template geometry.
5Explore the Target Sequence
What is the secondary structure? Is it homologous
to other protein sequences? Is it homologous to
other protein structures? What is the best
sequence alignment between your target protein
and homologous PDB structures? Examine the
regions of insertions and deletions. Are they
located in the loop regions? On the surface? Is
the region hydrophobic or hydrophilic? The PDB
template might have functional sites and
established motifs. Does your target sequence has
the same features? If disulphide bridges are
present in the PDB template, are cysteine
residues aligned?
6Homology Modeling with MPACK http//curie.utmb.edu
/mpack/
7Homology Modeling with MPACK http//curie.utmb.edu
/mpack/
Target sequence
BLAST / PSIBLAST
Multiple alignment - ClustalW
BLAST in PDB Secondary structure servers Fold
recognition servers
MASIA / PCPMer
Identify the best PDB template and
target-template sequence alignment
EXDIS extract geometric constraints from the
PDB template DIAMOD generate the target
structure based on distance constraints FANTOM
geometry optimization
GETAREA
MOLMOL
PROCHECK
CE - RMSD
DALI - RMSD
8Ribosomal Protein L30 SCOP Family L30e/L7ae
ribosomal proteins, d.79.3.1
1CK2 Saccharomyces Cerevisiae
1H7M Thermococcus Celer
Sequence identity 32 (32/100)
9Structural Classification of Proteins
SCOP - Structural Classification of
Proteins http//scop.mrc-lmb.cam.ac.uk/scop/index.
html CATH - Class, Architecture, Topology and
Homologous superfamily http//www.biochem.ucl.ac.
uk/bsm/cath/ CE - Combinatorial Extension of the
optimal path http//cl.sdsc.edu/ce.html FSSP -
Fold classification based on Structure-Structure
alignment of Proteins http//www.bioinfo.biocente
r.helsinki.fi8080/dali/index.html VAST - Vector
Alignment Search Tool http//www.ncbi.nlm.nih.gov/
Structure/VAST/vastsearch.html
10Secondary structure prediction
PHD - http//maple.bioc.columbia.edu/pp/ Psipred
- http//bioinf.cs.ucl.ac.uk/psiform.html PROF -
http//www.aber.ac.uk/phiwww/prof/index.html Jpr
ed - http//jura.ebi.ac.uk8888/
11Fold Recognition Servers
BLAST - http//www.ncbi.nlm.nih.gov/BLAST/ 3DPSSM
- http//www.sbg.bio.ic.ac.uk/3dpssm/html/ffreco
g_simple.html Fugue - http//www-cryst.bioc.cam.a
c.uk/fugue/prfsearch.html mGenTHREADER -
http//bioinf.cs.ucl.ac.uk/psipred/psiform.html S
AM-T02 - http//www.cse.ucsc.edu/research/compbio/
HMM-apps/T02-query.html FFAS03 -
http//ffas.ljcrf.edu/ffas-cgi/cgi/ffas.pl ESyPre
d3D - http//www.fundp.ac.be/urbm/bioinfo/esypred/
3D-JIGSAW - http//www.bmm.icnet.uk/servers/3dji
gsaw/ LOOPP - http//ser-loopp.tc.cornell.edu/cbs
u/loopp.htm Metaservers _at_TOME -
http//bioserv.cbs.cnrs.fr/HTML_BIO/frame_meta.htm
l GeneSilico - http//genesilico.pl/meta/ 3D
Jury - http//bioinfo.pl/Meta/
123D-PSSM Alignment between Mal d 2 and 1AUN
1
50 Mald2___PSS CCCCCHHHHH HHHHHHHHHC
CCCEEEEEEE CCCCCEECCC CCCCCCCCCC Mald2___Seq
MMKSQVASLL GLTLAILFFS GAHAAKITFT NNCPNTVWPG
TLTGDQKPQL -----------
- NNCPTVW d1aun___Seq
.......... .......... ...SGVFEVH NNCPYTVWAA
ATPVGG.... d1aun___SS .......... ..........
...CCEEEEE ECCCCCEEEE EECCCE....
50
100 Mald2___PSS CCCCCCCCCC CEEEEECCCC
C.CEEEEECC CCCCCCCCCE EEECCCCCCC Mald2___Seq
SLTGFELASK ASRSVDAPSP W.SGRFWGRT RCSTDAAGKF
TCETADCGSG ----------- GL- SAP
RWGRT CDAG CTDCG G d1aun___Seq
...GRRLERG QSWWFWAPPG TKMARIWGRT NCNFDGAGRG
WCQTGDCG.G d1aun___SS ...EEEECCC EEEEEECCCC
EEEEEEEEEE EEEECCCCCE EEEECCCC.C
13mGenTHREADER Results for Mal d 2
Conf Score E-val Epair Esolv AlnSc Alen
DLen Tlen PDB_ID
CERT 0.824
2e-04 -195.3 -13.8 850.0 204 207 246 1thv00
LOW 0.558 0.194 3.1 6.1 60.0 240
319 246 1qubA0 LOW 0.542 0.291 -3.8
8.1 52.0 242 309 246 1dnrA0 LOW 0.518
0.528 -18.5 4.9 76.0 92 117 246 1d4vA0
LOW 0.509 0.660 60.5 7.9 63.0 152
312 246 1dnqA0 LOW 0.501 0.825 14.5
7.5 49.0 159 171 246 9wgaA0 LOW 0.499
0.862 -44.0 -0.5 60.0 89 120 246 1hfh00
LOW 0.494 0.970 64.4 13.8 39.0 238
729 246 2tmdA0 GUESS 0.487 1.159 19.1
6.3 45.0 181 374 246 1dykA0 GUESS 0.485
1.228 33.8 1.3 54.0 104 160 246 1extA0
14mGenTHREADER Alignment between Mal d 2 and 1THV
------------------------CEEEEEECCCCCEEEE
EECCC--CEEEEEEEEECCC 1thv00 -------------------
-----ATFEIVNRCSYTVWAAASKGD--AALDAGGRQLNSG
Query MMKSQXXXXXXXXXXXXXXXXXXXXKIT
FTNNCPNTVWPGTLTGDQKPQLSLTGFELASK
CCCHHHHHHHHHHHHHHHHCCCCEEEEEEEECCCCCCCCCCCCCCCCCCC
CCCCCCCCCC 10 20
30 40 50 60
40 50 60 70 80
90 CEEEEECCCCCCCEEEEEEEEEEECCCCCEEEEEC
CCC-CCCCCCCCCC-CCCCEEEEEE 1thv00
ESWTINVEPGTNGGKIWARTDCYFDDSGSGICKTGDCG-GLLRCKRFGR-
PPTTLAEFSL
Query
ASRSVDAPS-PWSGRFWGRTRCSTDAAGKFTCETADCGSGQVACNGAGAV
PPATLVEITI CEEEEECCC-CCEEEEEECCCCCCCCCCC
EEEECCCCCCCEEECCCCCCCCCCCEEEEEE
70 80 90 100 110
15MPACK Alignment Rules (1)
- Some fold recognition servers truncate the
template or target sequences. Always check both
sequences - For the PDB template, compare the sequence with
the residues that have coordinates. They must be
identical. If some template residues do not have
PDB coordinates, put gaps (-) in the
corresponding places from the alignment - Sometimes the alignment (especially from BLAST)
will have X marking low complexity regions in
the target. You must edit it back to the original
residues - Make sure that only letters and numbers appear
in the target and template names - Gaps should be indicated only by - and not by
. - No spaces are allowed in the alignment block
16MPACK Alignment Rules (2)
- If you want to model a cysteine SS bond, the
target sequence must have c (instead of C)
for the 2 cysteine residues involved in the SS
bond - Only 20 upper-case letters and c are accepted.
Symbols such as X and Z must be excluded - Sometimes selenomethionine is be indicated by
X. Change to M and edit to template pdb file
(delete the Se line). - Small fragments of less than 4 residues are not
allowed. Translate them to the left or right,
depending on the secondary structure - Use the program alignrep to generate from a
hand-alignment a MPACK alignment
17MPACK Input Alignment for Mal d 2 1AUN
CLUSTAL W (1.7) multiple sequence
alignment Mald2 AAKITFTNNcPNTVWPGTLTGDQ
KPQLSLTGFELASKASRSVDAPSPW-SGRFWGRTRcS 1aun
SGVFEVHNNCPYTVWAAATPVGG-------GRRLERGQSWWFWAPP
GTKMARIWGRTNCN Mald2
TDAAGKFTcETADcGSGQVAcNGAGAVPPATLVEITIAANGGQDYYDVSL
VDGFNLPMSV 1aun FDGAGRGWCQTGDCG-GVLECKG
WG-KPPNTLAEYALNQFSNLDFWDISVIDGFNIPMSF Mald2
APQ-GGTGEcKPSScPANVNKVcPAPLQVKAADGSVIScKSAcLA
FGDSKYccTPPNNTP 1aun
GPTKPGPGKCHGIQCTANINGECPGSLRVPG------GCNNPCTTFGGQQ
YCCTQ----- Mald2 ETcPPTEYSEIFEKQcPQAYSY
AYDDKNSTFTcSG-GPDYVITFcP 1aun
GPCGPTELSRWFKQRCPDAYSYPQDDPTSTFTCTSWTTDYKVMFCP
18PDB Template File1AUN SS Bond and cis Pro
Information
SSBOND 1 CYS 10 CYS 205 SSBOND 2
CYS 52 CYS 62 SSBOND 3 CYS 67
CYS 73 SSBOND 4 CYS 121 CYS
193 SSBOND 5 CYS 126 CYS 176 SSBOND
6 CYS 134 CYS 144 SSBOND 7 CYS 148
CYS 157 SSBOND 8 CYS 158 CYS
163 CISPEP 1 THR 19 PRO 20
0 -0.10 CISPEP 2 PRO 79 PRO
80 0 0.00
19Bad Local Alignment from Fold Recognition Servers
3D-PSSM Mald2___PSS CCCCCCCCCC CHHCCCHHHC
CCCCCCCCCC CCCCCCCCCC CCCEECCCCC Mald2___Seq
ECKPSSCPAN VNKVCPAPLQ VKAADGSVIS CKSACLAFGD
SKYCCTPPNN ----------- -C VN-CPAL
-DG - CACF YC --- d1du5a__Seq
CSRGPRCAVD VNARCPAELR ...QDG...V CNNACPVFKK
DEYC..CVGS d1du5a__SS CCCCCEECCC CCCCCCHHHE
...ECC...E ECCHHHHHCC HHHH..CCHH mGenTHREADER
EECCCCEECCCCCCCCCCEECCCCCCCCCCEEEE--CCCCCC
CCEEECCCCEEEEEEECC 1extA0 RECESGSFTASENHLRHCLSC
SKCRKEMGQVEIS--SCTVDRDTVCGCRKNQYRHYWSEN
Query --CP-----APLQVKAADGSVISCKSACLAFGDSKY
CCTPPNNTPETCPPTEYSEIFEKQ
--CC-----HHHEECCCCCCCCCCCCHHHCCCCCCCCCCCCCCCCCCCCC
CHHHHHHHHH 170 180
190 200 210
20PDB Template Rules (1)
- Download the PDB template form PDB,
http//www.rcsb.org - Identify the chain which contains the template
atoms, and delete all other lines, including the
header. - The template residues in the PDB file and in
the alignment file must match exactly. - In case of selenomethionine rename MES to MET
and delete the SE line - PDB file should end with a line containing TER
followed by a line containing END - The file should end with the extension .pdb
- If the PDB file consists only of C-alpha trace
(only CA atoms) use program like MAXSPROUT
http//jura.ebi.ac.uk8181/holm/
dali_align.cgi?modemaxsprout) to generate the
geometry of the main chain atoms
21PDB Template Rules (2)
- Read the header of the PDB file to find if it
contains SSBOND records. Cysteine residues are
highly conserved, and if the template has SS
bonds and in the target-template alignment the
same cysteine residues are aligned, then most
probably the target has a SS bond between the
corresponding residues - In some cases, atoms in a residue have 2
alternative positions. In 1EJG (crambin) residue
7 in chain A, ILE, has 2 positions - Â
- ATOM 101 CA AILE A 7 8.829 2.039
13.300 0.55 2.58 C - ATOM 102 CA BILE A 7 9.104 2.209
13.197 0.45 2.14 C - ATOM 103 C AILE A 7 7.559 2.141
12.460 0.55 2.53 C - ATOM 104 C BILE A 7 7.839 2.105
12.369 0.45 2.15 C - ATOM 105 O AILE A 7 7.573 2.124
11.205 0.55 2.78 O - ATOM 106 O BILE A 7 7.990 2.102
11.154 0.45 2.53 O - Â
- Delete the atoms of one residue (AILE or BILE)
and rename the residue to ILE. See also 1QKG for
a non-standard residue labeling
22MPACK Input PDB File 1AUN
ATOM 1 N SER 1 37.180 6.414
1.698 1.00 24.44 N ATOM 2 CA
SER 1 37.217 6.785 3.131 1.00 23.97
C ATOM 3 C SER 1
37.348 8.302 3.247 1.00 22.37 C
ATOM 4 O SER 1 38.140 8.912
2.525 1.00 25.76 O ATOM 5 CB
SER 1 38.422 6.113 3.792 1.00 26.22
C ATOM 6 OG SER 1
38.277 6.114 5.196 1.00 44.20 O
ATOM 7 N GLY 2 36.547 8.903
4.125 1.00 18.51 N ATOM 8 CA
GLY 2 36.584 10.340 4.342 1.00 13.21
C ATOM 9 C GLY 2
36.029 11.234 3.235 1.00 20.13 C
ATOM 10 O GLY 2 36.090 12.454
3.356 1.00 18.26 O . . . . . . . . .
. . . ATOM 1576 O PRO 206 14.735
13.373 -3.717 1.00 28.48 O ATOM
1577 CB PRO 206 15.756 13.247 -0.497
1.00 25.18 C ATOM 1578 CG PRO
206 16.485 14.010 0.541 1.00 29.06
C ATOM 1579 CD PRO 206 15.985
15.405 0.356 1.00 21.30 C TER END
23Executing MAPCK (1)
Telnet to curie.utmb.edu in your account (model1,
, model5) and put in a folder the main alignment
and the template PDB file. Type mpack to start
MPACK, and answer the questions Type core name
of the output file ? The name will be used to
create the output files Do you have a topology
file (y/n) ? Usually, n Do you have a
template to extract constraints from ? y for
homology modeling Name of the seq. align. file ?
Enter the alignment file name, with the
extension .aln. It must be in CustalW
format Enter the name of the pdb file ? Enter
the PDB file name, with the extension .pdb.
24Executing MAPCK (2)
Enter interval (deg) for PSI/PHI (default 10)
Decrease the value if you want tighter
constraints (the model will be more similar with
the PDB target do this when there is a high
identity in the alignment) Enter interval (deg)
for OMEGA (default 5) Upper/lower intervals
for the OMEGA dihedral angle. Note currently if
the residues dont match in the alignment and if
the template consists of CISPEP then one should
edit manually the OMEGA angle (ex.aco)
appropriately and rerun the diarun.sh script.
Number of distance constraints per atom (default
15) Specify the number of constraints you
would like to extract per atom. Use 30 to 50 only
for high identity in the alignment Enter
threshold between upper and lower limits (default
0.5) Specify the threshold of the upper and
lower distance limits. Increase the threshold if
you like to have more flexibility (when the id
between the template and target is low) and
reduce it if the proteins are highly similar
25Executing MAPCK (3)
Do you like to specify fragments ? Usually
answer n. Fragments are automatically extracted
from the alignment by MPACK from the regions that
do not contain gaps. The first and last residues
from a fragment are deleted to allow flexibility.
Do you have another template to extract
constraints from (y/n) ? Answer y if you have
multiple templates Enter the starting residue
number of the target sequence in the alignment
file ? This question appears only for multiple
templates Do you have disulphide bonds in the
target ? If y the following two questions
will appear No. of SS bonds ? Enter the number
of SS bonds (must match with the number of c
pairs in the alignment)
26Executing MAPCK (4)
Enter res. In 1 SSBND ? (RES1, RES2) Enter the
residues from the first SSBOND. The numbers
correspond with the labels on the residues from
the target sequence, not from the template. The
program will continue by asking for the residues
from the remaining SSBONDs Do you want to run
DIAMOD ? Answer y to model the target
protein How many structures do you want to
generate ? Answer 1 if you want just one
model or give the number of models you want to
generate. If you want more than one model, MPACK
will ask for a random seed number Enter how many
cycles of SECODG do you want to run (1 if no
SECODG) ? Usually answer with 1
27MPACK Results (1)
If the core file name given is XXXX then the
following output files will be produced XXXXMPAC
K.LOG log file containing execution details,
MOLMOL scripts XXXX_Yex.aln reformatted
alignment file, i.e. master align has
Y1 alignYre.pdb renumbered pdb file to be used
for MOLMOL XXXXexdisrun.sh script to run exdis,
only for the master alignment XXXXex.aco angle
constraints XXXXex.upl upper limit for
interatomic distance XXXXex.lol lower limit for
interatomic distance XXXXex.ang approximate
starting structure for DIAMOD XXXXdia.min minimiz
ation file for DIAMOD XXXXdiarun.sh script to
execute DIAMOD XXXXdiaY.clol corrected lower
limit file valid if you use SECODG gt 1
28MPACK Results (2)
XXXXdiaY.cupl corrected upper limit file if you
use SECODG gt 1 XXXXdiaY.lol modified lower limit
file in round Y XXXXdiaY.lol modified upperl
limit file in round Y XXXXdiaY.ovw overview file
containing target function for the round
Y XXXXdiaY.res result, minimization details of
DIAMOD cycle run Y XXXXdiaYZZZ.cor DG coordinate
file for the model number ZZZ generated in cycle
Y XXXXdiaYZZZ.ang ANG file for the model number
ZZZ generated in the cycle Y Mscr_targ.seq FASTA
format of target sequence
29DIAMOD .ovw File Overview of Ten Models
struct target upper limits lower limits
van der Waals torsion angles function
sum max sum max sum max
sum max 1 10 430.5532508634.9
9.0434709378.6 9.19 0 51.0 0.48 15 840.4
175.0 2 1 430.5725906799.7 5.8427167392.3
6.50 9 60.4 0.92 23 743.1 175.0 3 6
446.354659 6.9031288633.0 9.95 6 64.6
0.77 16 814.3 175.0 4 4 452.544098
7.68419110.17 7 66.8 1.46 16 695.2
175.0 5 8 473.315060 8.744935
9.78 12 77.9 1.05 27 859.5 175.0 6 9
478.30483910.31496713.10 9 79.0
0.92 23 846.3 175.0 7 5
486.37469711.23515711.05 11 76.7
0.76 23 1015.8 175.0 8 3
538.75605412.42599621.98 14 98.4
0.79 33 1047.9 175.0 9 2
546.91666411.82712420.92 10 84.6
1.34 35 1082.6 175.0 10 7
549.48687714.66679124.49 5 78.8
0.64 35 1016.1 175.0
30FANTOM Input Files
DIAMOD .ang file the .ang file for the best
model EXDID .aco file the angle constraints
file EXDIS .lol file the lower limit for
interatomic distances EXIDS .upl file the upper
limit for interatomic distances fantom.lst contai
ns the name of the DIAMOD .ang file fantom_scr sc
ript that controls the FANTOM execution
31FANTOM Script File fantom_scr
Change only the lines that appear below set
ang_name T192m1_1qsm_Di.ang set aco_name
T192m1_1qsm_Ex.aco set lol_name
T192m1_1qsm_Ex.lol set upl_name
T192m1_1qsm_Ex.upl set output_base
T192m1_1qsm set working_dir /home/people/model1
/Prot_Model/Fantom disulfide 10 222 58
68 73 80 128 211 133 194 . . . loweight
4 0.5 upweight 4 0.5 dhweight 1.0
5.0 minimize 300 Number of iterations
32FANTOM .su2 File
Conf'l Elctrc H-Bond Lennard SolvatTors'n
Dislfd Upper Lower Dihd'l Grad't 4085.211
557.07 -18.13 1223.79 0.00 170.75 2151.7 7.E06
2.E06 26567. 1.E09 2284.354 535.07 -169.1
1006.37 0.00 407.64 504.40 1991.5 1372.5 1036.2
71907. -114.554 372.49 -223.0 -823.13 0.00
530.99 28.106 556.05 300.09 816.70
6411.2 -480.890 353.04 -226.5 -1115.9 0.00
487.64 20.840 122.71 48.261 682.46
4404.7 -694.661 267.28 -246.0 -1154.8 0.00
418.40 20.504 138.25 90.551 521.05
2421.2 -783.787 259.86 -249.4 -1242.1 0.00
427.27 20.583 149.03 101.48 168.82
1385.3 -890.366 221.49 -257.2 -1313.9 0.00
442.76 16.472 138.06 75.461 221.38
458.26 -923.969 222.80 -256.7 -1317.8 0.00
411.37 16.306 142.26 74.709 226.21 0.0094
33Pathogenesis-Related PR-5 Proteins 1AUN and Mal d
2 (Model)
1AUN
Mal d 2
MOLMOL RMSD 0.496 Ã…
34Evaluate the Model
Compute the RMSD (root mean square difference)
with MOLMOL, CE and DALI DALI http//www.ebi.ac.u
k/dali/ CE - Combinatorial Extension of the
optimal path http//cl.sdsc.edu/ce.html Use
PROCHECK to perform Ramachandran plots and other
standard checks PROCHECK http//www.biochem.ucl.
ac.uk/roman/procheck/procheck.html
35MOLMOL Script to Compute RMSD between the Model
and Template
ReadPdb Md2m4f_1.pdb ReadPdb 1aun.pdb DefPropAtom
'ab1' '12-22,32-47,50-73,77-83,87-121,124-148,15
7-172,180-212,215-223 bb' DefPropAtom 'ab2'
'22-22,25-40,44-67,70-76,79-113,117-141,144-159,
162-194,198-206 bb' SelectMol 'num
1' ColorBond 1.000 0.000 0.000 SelectMol 'num
2' ColorBond 0.000 0.000 1.000 SelectMol '' Fit
to_first 'ab1 ab2'
36Homework
Submit your protein sequence to fold recognition
servers Collect results Compare and rank the
alignments Identify the best alignments and PDB
template(s) Prepare the MPACK input, i.e.,
alignment and PDB files