Title: Query sequence
1Structure-Sequence alignment Structure is
better preserved than sequence
Query sequence MTYKLILNGKTKGETTTEAVD
AATAEKVFQYANDNGVDGEWTYTE
Me!
Me!
Me!
Me!
2How can we match a sequence and a structure?
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
Sequence Similar Sequences take this structure
(but remember sequence is less preserved than
structure)
Pair-InteractionHow well do AAs get along
(Positive hate positive? Maybe not?)
- more
- 2nd structures prediction.
- 2nd structures constraints (ß-strands forming ß
-sheets) - etc.
Solvation which AAs are buried?
3GenTHREADER
An Efficient and Reliable Protein Fold
Recognition Method for Genomic Sequences David
T. Jones (1999)
What a good presentation! B. Raveh (2003)
4GenTHREADER overview
Query sequence MTYKLILNGKTKGETTTEAVD
AATAEKVFQYANDNGVDGEWTYTE
Templates
- For each template (in the Brookhaven PDB)
- Construct a profile sequence
- Align with query sequence
- Calculate structural parameters (to be
continued) - send parameters to a well-trained NEURON NETWORK
(like PSIPred) - OUTPUT match confidence alignment
-
5STAGE 1 Building a profile for each template
- Start with sequence of template
peptideMTPAVTTYKLVINGKTLKGETTTKAVDAETAEKAFKQYAN
DNGVDGVWTYDDATKTFTVTC - Run BLASTP on OWL non-redundant protein sequence
data bank, with sequence as input. - Take all sequences with E-Value lt 0.01.
- Align using MULTAL multiple sequence alignment
method. - Construct a sequence profile based on BLOSUM 50
matrix.
6STAGE 2 Align sequence with a profile
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
SCORE ?
Length of query sequence ?
Length of alignment itself ?
Length of template profile ?
7STAGE 3 calculate (some) structural parameters
In stage 2, the sequence was aligned to a profile
of the structure.
The aligned sequence is now imposed on the 3D
structure of the template, and used for ENERGY
POTENTIALS calculation.
8STAGE 3 structural parameters (cont.)
E-Pair (pair interaction potential)
- an energy potential for the probability of the
interactions observed in this structure. - Distance and sequence separation between certain
atoms of two different amino-acids are measured
(Cß Cß , Cß - N, Cß O, etc.) - Statistics of known structures were gathered and
weighted. - The observed interactions are compared to the
statistics - An energy potential is calculated
- In essence the smaller E-Pair, the better.
-
aa 39
aa 157
9STAGE 3 structural parameters (cont.)
E-Solv (solvation potential)
- Degree of burial (DOB) for an amino acid the
number of other Cß atoms located within 10Å of
the residues Cß atom - In general, hydrophobic amino acids like to be
buried, safely away from water. - Hydrophilic acids might like the outside world
better. - Each amino acid DOB is calculated.
- Its compared to statistical occurrence.
- ?Esolv(AA,r) -RT ln( f(AA,r) / f(r) )
-
Cß
10Ã…
Cß
Cß
Cß
Cß
Cß
10STAGE 4 send it all to the (trained) Neuron
Network
Ouput is a score between 0-1 translated to
confidence level (Low, Medium, High Certain)
11See this page on the web
12Who trains the Neural network?
- CAT numbers were used for comparing pairs.
- 9169 chain pairs
- 383 pairs shared a common domain fold ( should
give a positive answer) - The network was trained with these pairs.
13Neural network black box?
14Confidence assignment
CERTAIN
LOW
MEDIUM
HIGH
15GenTHREADER what to do with it?
- Results on a classic test set of 68 proteins
- High true-positive rate 73.5 correctly
recognized, 48.5 with CERTAIN. - Extremely reliableEvery CERTAIN prediction
was correct. - Fast automatic method.
- For 22 of 68 proteins, alignment is over 50
accurate. - Lets go analyze the Mycoplasma Genitalium with
it!
16Whole Genome Analysis with GenTHREADER
Mycoplasme Genitalium genome analysis ONE DAY
ONLY!
17ORF MG276 of mycoplasma gen. spotting a remote
homologue
- MG276 is an Adenine Phospho-ribosyl-transferase
(but this information is not given to
GenTHREADER) - 1HGX is a template of other Phospho-ribosyl-transf
erase. - It has only 10 sequence identity with our MG276!
- It was found by GenTHREADER as a certain match
- E-Pair saved the situation!
- But how do we know its true?
18Ligand binding site of 1HGX template
19ORF MG276 of mycoplasma gen. supporting
evidence for 1HGX as a template
20ORF MG353 of mycoplasma gen. an ORF with no
known function
- MG353 no homologues found in databases
- 1HUE is a template of an Histone-like protein
- Very low sequence similarity with our MG353.
- It was found by GenTHREADER as a certain match
- Striking similarity in DNA Binding regiondespite
overall low sequence similarity
21GenTHREADER improvements(McGuffin, Jones - may
2003)
- PSI-BLAST, PSI-PRED (2nd stuructures), some more
- Some Results
22AB-INITIO FOLDING - ROSETTA (Simons et al 1997,
1999, Bystroff Baker 1998, Bonneau et al
2001) Prediction of a protein fold from scratch?
Method I physically simulate protein
folding Problem CPU time Practical for short
peptides
APKFFRGGNWKMNGKRSLGELIHTLGDAKLSADTEVVCGI
APSITEKVVFQETKAIADNKD WSKVEVHESRIYGGSVTNC
K ELASQHDVDGFLVGGASLKPVDGFLHALAEGLGVDINAKH
Method II check probability for all possible
conformations Problem infinite search
space Solution use mother nature decrease
search space
23Decreasing the search space using elements from
short peptides
- Take fragments of short peptides (3 residues 9
residues long). - Join them together
- Keep the 2nd structures constant.
- Play with the angles of loop residues.
- RESULT 200,000 decoy structures
24In addition - I-Sites prediction 13
local-structure 3D motifs with sequence profiles
- Strong independence of motifs (fold-initiation
sites?)
- complements secondary structure
25Find the correct fold for a given sequence (back
to threading)
- P(sequence structure)
- Solvation
- 2nd structure amino acid (proline in helix,
etc.) - Pair Interaction
- ISites prediction for this sequence(3D motifs)
did not contribute to performance - Etc.
- P(structure) sequence independant
- 2nd structure packing
- Strand hydrogen bonding
- Strand assembly in sheets
- Structure compactness
- Frequency of I-Sites 3D motifs
- Etc.
26RESULTS in CASP 4 Bakers a winner
27We're done!