Title: Structure prediction
1Structure prediction
- Why do we need structure prediction ?
- Intellectual
- Practical
- Levinthals paradox
- Anfinsens experiments
- How it will be solved
- Physics
- Computer science
- The protein design problem
24 Basic Levels of Protein Structure and all
information is in the sequence
3What is homology
4Structures change linear with ED
5Homology
- Similar sequence - Similar structure ?
6Examples of similar structure
7Zones
8Homology can be detected by Sequence Alignment
- Key aspect of sequence comparison is sequence
alignment - A sequence alignment maximizes the number of
positions that are in agreement in two sequences
9Alignments
- Local alignment
- Global alignments
Global Alignments LGPSTKDFGKISESREFDN
LNQLERSFGKINM-RLEDA Local
Alignments ----------FGKI----------
----------FGKI----------
10Dotplots
11Methods to align
- Optimal alignment
- Maximise similarities
- Minimize gaps
- Score of an alignment
- Score for substition
- Gap-opening and gap-extension costs
- Dynamic programming
- Finds optimal solution
- BLAST
- Heuristic, fast algorithm using indexes (hashes)?
12When are two sequences homologous
13Identities do not provide the best similarity
14Statistics of Sequence scores
- Local alignments
- Follows extreme value distribution
- Scores depends on log(length)?
- E (or P-value)?
- Global alignments
- Heuristics
- Randomize sequences
15Statistical comparison of alignment scores
16How to improve alignments
- Use more evolutionary information
- Multiple alignments
- Profiles
- HMMS
- Profile-profile alignments
- Using additional information
- Structure
- Structural alignments
17Multiple sequence alignments
- Computationally intensive
- Heuristic methods
18Profiles can be used to detect distant homologs
- Extra information
- How to best use
- Different methods
- Patterns
- Evolutionary method
- Profile methods
- HMMs
- ANNs
19PSI-BLAST in a nutshell
- With a protein sequence as query, use BLAST to
search a protein sequence database. - Collapse significant local alignments (those with
E-value less than or equal to a set threshold h)
into a multiple alignment, using the residues of
the query sequence as alignment-column
placeholders. - Abstract a position-specific score matrix from
the multiple alignment. - Search the database with the score matrix as
query. - Iterate a fixed number of times, or until
convergence.
20Protein structure prediction (and other uses for
molecules in life in a computer)?
- Secondary structure predictions
- Homology detection
- What is homology
- Why is is related to protein structure
- How does it work
- Simulations of folding
- What is physics ?
- Realistic simulations (folding_at_home)?
- Smart simulations (rosetta_at_home)?
21It's not that simple...
- Amino acid sequence contains all the information
for 3D structure (experiments of Anfinsen,
1970's) - But, there are thousands of atoms, rotatable
bonds, solvent and other molecules to deal
with...
22All the 3D information is in the sequence
23Levinthal Paradox
- Cyrus Levinthal, Columbia University, 1968
- Levinthal's paradox
- If we have 3 rotamers per residues a 100 residue
protein have 3100 possible conformations. To
search all these takes longer than the time of
the universe. But proteins fold in less than a
second. - Resolution Proteins have to fold through some
directed process - Goal is to understand the dynamics of this process
24Old vs. New Views of folding
- Old
- Hierarchical view of protein folding
- Secondary structures form, then interact to form
tertiary structures - General order of events
- New
- Statistical ensembles of states
- Potential energy landscape
- Folding Funnel
25Two alternatives for structure prediction
- Simulation of protein folding
- Folding_at_home (Erik next week)?
- Identification of lowest energy structure
- More successful (today)?
- Several layers
- Secondary structure
- 3D-structure
26Secondary structure prediction
- AA preferences for different SS
- Pro
- Does not have a NH backbone
- No H-bonds
- Prefers Coils
- Also in N-terminal part of helices and Beta-turns
- Gly (compared with Ala)
- No sidechain on Gly (more flexible)?
- Polar groups in loops
- Additional H-bonds to backbone
27Amino acid preferences in coil
28Amino acid preferences in ?-Strand
29Secondary structure preferences
- C? branched AAs prefers sheets
- Entropic cost in helices of sidechain rotations
- Hydrophobic groups prefers SS-elements
- Negatively charged residues at C-terminal end of
helices due to dipole effect.
30Amino acid preferences in ?-Strand
31Amino acid preferences in ?-Helix
32Secondary structure preferences
- C? branched AAs prefers sheets
- Entropic cost in helices of sidechain rotations
- Hydrophobic groups in SS-elements
- Polar in loops
- Negatively charged residues at C-terminal end of
helices due to dipole effect.
33Templates for helix, loops and sheet
34More elaborate templates
- Key residues
- Gly in turns
35Incorporating globular effects
36Exemple of SS predictions
37PhD (Rost Sander, 1994)?
38PhD-Input
39PhD-architecture
40PhD-predictions
41PhD summary
- First methods with gt70 Q3
- Correct length distributions
- Much better beta strand predictions
- Good correlation between score and accuracy
- Better predictions for larger multiple sequence
alignments
42Threading
- A priori prediction of the Interferon fold in
1985 - Good precdiction of helices
43Prediction of interferon fold
44FR methodologies
453D profiles
46Threading or Fold recognition
VIFVLWGNAARQKCN LLFQTKHQHAVLACPH
47PROSA/THREADER
48How good is FR?
- LiveBench and CASP measure performance.
- E-values work reasonably well.
- In the real world, you might get a few percent
more hits'' with FR compared to PSI-BLAST. - Individual researcher vs. genome-wide analysis
- Structure information not necessary?
49Sucess of FR
50Alignments are not always perfect
51Does threading really work ?
- Evolutionary methods work better
- Secondary Structure Predictions might help
52AB-INITIO methods
- Simulate the process of folding
- Folding_at_home - MD simulations of small peptides
- Find the lowest energy structure
- Not simulate process.
- Consequences of small energy gap
- Unrealistic to model exactly
- Easier to distinguish between Correct/Incorrect
than between Folded/non-folded.
53How Rosetta Works
- Minimize energy in the folded state
- Uses a combination of energy formulas based on
the likelihood of particular structures, and the
fitness of the sequence - Side-chains simplified to a centroid located at
center of mass of the side-chain - Average of observed side-chain centroids in known
structures - Local sequence does not decide the local
structure, it only biases the decision - Non-local favorable conditions
- Buried hydrophobic fragments
- Paired ß strands
- Specific side-chain interactions
54Rosetta clustering the models
- Compare models to each other with RMSD
- Models can come from different family members
- Cutoff varied to give 80-100 members in largest
cluster - The largest clusters are assumed to contain the
best structures (attractors in folding space...?)
55Recent improvements to Rosetta
- Refinement in HR rosetta
- Make small dihedral changes
- Rebuild sidechains
- Minimize (in dihedral space)?
- Evaluate energy
- Go To 1
- 5 out 16 small proteins lt 1.5 Ã…
56Physics of Rosetta
- Is Rosetta physical ?
- What is the most important terms in globular and
local free energies ? - How does proteins really fold ?
- What do you think ?
57Designs
- Molten globule designs
- Regan 50
58Deign of four-helical bundle (De Grado, 1991)
Molten Globule
59What characterizes a molten globule
- Compact
- Good secondary structure
- Not solid
- Sidechains not packed
- No cooperative folding
60Mayo method
- Automatic design
- Take fold (backbone)?
- Take sequence (random)?
- Mutate sequence
- Build sidechains
- Calculate energy
- Accept/reject
- Go to 3
61Designing a non-zinc finger
62Design of a non-zinc finger (Dahiyat and Mayo)
63Alfabetin
64Non-MG alfabetin shows cooperative folding
65TOP7 (Rosetta Design)
- Novel fold
- Iterate between design and refinement
- Non molten globular behavior.
66Iteration between Seq and Str
Sequence Structure
67The project
- Three goals
- Learn how to develop a (binary) predictor
- Read the background literature
- Write a scientific report about your work
- Write a program that can do all of this.
- Additional goal (for top grades)
- Make a web-server
- Combine your predictor into a full system
68Tools
- Python (or other language)
- Write scripts to do the work
- svmlight
- Used in the bioinformatics course
- Preparsed datafiles
- Annotations
- PSIBLAST needs to be parsed
- Evaluations programs should be developed.
69The projects
- Binary classifier
- Alpha-helix/other etc..
- Surface area
- Membrane non-membrane
- Globular or membrane datasets
70The program
- Input.
- Sequence in fasta file
- Output
- A prediction for each residue in the sequence
71Web-server
- For top grades (A,B) a web-server should be
developed, using the following steps. - Learn how to use PHP
- Ask for an account on a web-server.
- Use the templates index.php available from the
web-page.
72The report
- The following sections
- Abstract
- Introduction
- Methods
- Results and Discussion
- Conclusions
- References
- More info May 8