Title: Introduction to Bioinformatics 10' Machine Learning for Protein Structure Prediction
1Introduction to Bioinformatics10. Machine
Learning for Protein Structure Prediction 2
- Course 341
- Department of Computing
- Imperial College, London
2Coursework
- Link from web page now works
- Use gap penalty of -1
- An alignment must have
- Both sequences the same length after adding gaps
3Proteins
- Macromolecules called heteropolymers
- Complex conformation of amino acids (residues)
- Below around 40 residues theyre called peptides
- Conformation Fold 3D Structure
- Dictates the function of the protein
- Which dictates how cells behave, etc.
- 40-50 residues tends to be the lower limit
- On proteins having a biological function
- See LESK textbook
4Why Study Proteins?
- Major biological macromolecule responsible for
the machinery and scaffold of cellular life - Prof. Michael Sternberg
- 3D Structure provides insight into
- Function
- Evolution
- Experimental design
- Systematic design of drugs
5Amino Acids
- Amino acids form the basic unit of all proteins
- A single amino acid always has
- An amino group NH2
- A carboxyl group COOH
- A hydrogen H (sometimes left off diagrams)
- A chemical group or side chain -"R".
- These are all joined to a central carbon atom
- The alpha carbon
6Amino Acid Picture
7The Primary Structure of Proteins
- Fixed main chain backbone
- Variable side chain
- One of 20 different amino acid side chains
- Chirality (left and right handedness)
8Hydrophobic Interactions
- Atomic charges dictate how folds occur
- Groups of C-H atoms have little charge
- Called hydrophobic or non-polar
- Hydrophobic groups pack together
- To avoid contact with solvent (aqueous solution)
- To minimise energy
- Hydrophobic and hydrophilic regions are
- Main driving force behind the folding process
9Protein Folding in anAqueous Environment
H20
H20
H20
H20
H20
H20
H20
H20
H20
Folded chain Many atoms shielded from water
contact
Unfolded chain
10Secondary Structures
- Energetics of the side chain
- Stop certain fold conformations
- Protein structures are not random
- Some larger (secondary) structures are observable
- Two common structures
- Alpha-helix
- Beta-sheet
- Connected by turns and loops
- E.g., Beta-turns
11Alpha Helices
12Beta Sheets (front and side)
13Protein Folds
- Tertiary structure
- 3D structure of an entire amino acid chain
- Sequential arrangements of chain sections
- Particularly alpha-helices and beta-sheets/strands
- Quaternary structure
- Proteins with more than one chain
- Arrangement of these chains
14Example of Quaternary Structure
15Structural Classes of Folds
- Different classes, including
- Alpha/Alpha
- Mainly packing of alpha helices
- Beta/Beta
- Mainly one or more beta sheets
- Alpha/Beta
- Roughly alternate alpha helices and beta sheets
- AlphaBeta
- Mixed alpha helices and beta
- Coil
- Mainly small proteins (fewer than 50 residues)
16Examples of Common Classes
17Example Proteins (OspC protein)
18Triose Phosphate Isomerase
19Hemerythrin
20Homologous Genes
- Similar sequences produce very similar structures
- Secondary structures are well preserved
21Back to Machine Learning
- Remember the task
- Automatically learn methods for predicting the
structure of proteins
Structure Predictor
Learning process for the chosen representation
Machine readable format
22Predicting Secondary Structurefor Each Residue
Structure Predictor
a a a a coil coil ß ß ß ß
Alpha Helix
Beta Sheet
Further Processing
23The CASP Competition
- Determining actual structure of a protein is hard
- Its a good idea to predict the structure
- Many, varied approaches to structure prediction
- Which one is the best?
- Comparative Assessment of Protein Structure
Prediction - Blind trial to evaluate the different approaches
- Chemists do not publish the structures theyve
found - CASP - manual intervention is allowed
- CAFASP2 - completely automatic prediction only
- Around 60 targets are used for the evaluation
24Continuous Server Evaluation
- Every time a new structure is determined
- Every method on a server is assessed
- Regular updates on status are provided
- Two well known services
- Livebench (Poland)
- Eva (Columbia, USA)
- Roughly 100 structures per annum are tested
25Current State of the Art
- Secondary structure prediction
- Usually in the form of 3-state prediction (Q3)
- Alpha, Beta, Coil
- Majority class predictor gets about 40
- Prediction has gone from 60
- Using rule-based methods
- To about 80 using machine learning methods
- Neural networks, SVMs are particularly good
- Inductive Logic Programming also very useful
- Learns explainable rules which add to the science