Title: Protein Structure Prediction using Decision Lists
1Protein Structure Prediction using Decision Lists
- Volkan KURT
- Department of Computational Sciences
2Introduction
- Protein Structure
- What is Secondary Structure?
- What is Tertiary Structure?
- Secondary structure Prediction
- What are decision lists?
- GDL in Action
- Tertiary Structure Prediction
3Protein Structure
- Primary Structure
- Sequences
- Secondary Structure
- Frequent Motifs
- Tertiary Structure
- Functional Form
- Quaternary Structure
- Protein complexes
4Primary Structure
- Sequence information
- Contains only aminoacid sequences
- 24 amino acid codes present
- 20 standard residues
- Glutamine or Glutamic Acid ? GLX (GLU)
- Asparagine or Aspartic Acid ? ASX (ASN)
- Others (Non-natural/Unknown) ? X
- Selenocysteine, Pyrrolysine
5Secondary Structure
- Rigid structure motifs
- Do not give information about coordinates of
residues - Can be seen as a one-dimensional reduction of the
tertiary structure - If accurately predicted, can be used to
- Predict the final (tertiary) structure
- Predict the fold type (all-alpha/all-beta etc.)
6Common Secondary Structure Motifs
alfa-helix
Parallel beta-sheet
Antiparallel beta-sheet
7Tertiary/Quaternary Structure
- Tertiary Structure
- The functional form
- Coordinates of residues in the space
- Quaternary Structure
- Protein Protein complexes
- Assembly of one or more proteins
8Tertiary / Quaternary Structure
9Structure Prediction
- Easier to determine sequence than structure
- Predictions may help close the gap
10Secondary Structure Prediction
- Assesment of Prediction Accuracy
- Common Strategy
- Methods in Literature
- Decision Lists
- Prediction using GDL
- A Performance Bound
11Secondary Structure Prediction
- Predictions based on
- Sequence Information
- Multiple Sequence Alignments
- Various algorithms present based on
- Information Theory
- Machine Learning
- Neural Networks etc.
12Assessment of Accuracy
- Determination method
- DSSP
- Performance Metric
- Q3 accuracy
- Three state accuracy (helix/strand/loop)
- Data set selection
- Non-redundancy
- Homology Information
- Multiple Sequence Alignments
- Cross-Validation
13Two Levels of Prediction
Sequence
- First Level
- Sequence to Structure
- Input
- Sequence Information
- Multiple Sequence Alignments
- Method
- Machine Learning
- Neural Networks
- Output
- Secondary Structure
MSA
Sequence to Structure
Secondary Structure
14Two Levels of Prediction
Secondary Structure
- Second Level
- Structure to Structure
- Input
- Structure Information
- Method
- Machine Learning
- Neural Networks
- Filter
- Simple Filters
- Jury Decisions
- Output
- Secondary Structure
Structure to Structure
Filter
Secondary Structure
15GORV
Sequence
Secondary Structure
PSI-BLAST
6.5
66.9
Majority Vote
Information Function / Bayesian Statistics
Filter
Secondary Structure
Secondary Structure
73.4
Garnier et al, 2002
16PHD
Secondary Structure
4.3
Neural Network
62.6 / 67.4
Jury Filter
3.4
Secondary Structure
70.8
61.7 / 65.9
Rost Sander, 1993
17JNet
Profile
Secondary Structure
PSIBLAST HMMER2 CLUSTALW
Neural Network
Neural Network
Jury Jury Network
Secondary Structure
Secondary Structure
76.9
Cuff Barton, 2000
18PSIPRED
Secondary Structure
Profiles
PSI-BLAST
Neural Network
Neural Network
Secondary Structure
Secondary Structure
76.3
Jones, 1999
19Decision Lists
- Machine Learning method
- Simply, a list of rules
- Each rule asserts a guess
- Generalization by simple rule pruning
- Output is human readable/understandable
20Sample Decision List
LeftHelix and RightHelix
Helix
LeftStrand and RightStrand
Strand
LeftHelix and RightLoop
Helix
Else
Loop
21GDL
- Greedy Decision List
- Start with a global (base) rule
- At every step
- Find the maximum gain rule
- Append to previous list
- Stop when gain change is 0
22Rule Search
- Initially evertyhing is predicted to be the
mostly seen structure (i.e. loop)
False Assignments
Correct Assignments
Training Set
Partition with respect to the Base Rule
23Rule Search
- At each step add the maximum gain rule
-
-
Partition with respect to the Second Rule
Partition with respect to the Base Rule
24Data Representation
- Frames of length W
- Context of an aminoacid is represented by W
residues - (W-1)/2 to the left. (W-1)/2 to the right
- If the frame exceeds terminii, they are
represented as NAN - GLX GLN. ASX ASN.
- New found/Non Natural aas X
25Sample Data
- evealekkvaaLesvqalekkvealehg
- Frame Size 5
- Represents the features used in the prediction of
secondary structure for L (leucine)
26GDL
- DSSP assignments
- Reduction
- E (extended strand), B (b bridge)-gt Strand
- H (a helix ), G (3-10 helix) -gt Helix
- Others -gt Loop
- Data set
- CB513 set
- 7-fold cross-validation
272-level Algorithm
- Sequence to Structure List
- Find the first rule that matches the data point
- Assign the output of that rule
- A frame of 9 residues is input
- Output Secondary Structure
- Structure to Structure List
- After all predictions are made, check for
possible improvements - A frame of 19 secondary structures is input
- Output Secondary Structure
28GDL
Sequence
Secondary Structure
PSI-Blast
6.67
GDL
GDL
Secondary Structure
Secondary Structure
60.48
62.54 / 69.21
29GDL Performance
- Performance of first level decision list
- Without homologs 60.48 (29 to 66 rules)
- With homologs 66.36 (46 to 68 rules)
- Performance of second level decision list
- Without homologs 62.54 (18 to 116 rules)
- With homologs 69.21 (16 to 40 rules)
30GDL Performance
- Performance at 20 rules at both steps
- Without homologs 62.15
- With homologs 69.08
- Possible to make a back-of-the-envelope structure
prediction using our model
31GDL/PHD/GORV
32Discussion - Why GDL?
- Amazingly simple models
- With as low as 20 rules in the first level and as
low as 20 rules in the second - Rules (Models) are human-readable
- Biological rules may be inferred
- Second level decision list may be used as a
filter for other algorithms
33GDL Rules
- The first three rules of the sequence-to-structure
level - 58.86 performance (of 66.36)
- First Rule
- Everything ? Loop
- Default rule
34GDL Rule 2
35GDL Rule 3
36A Performance Bound Claim
- Using only sequence information. the highest
achievable performance has an upper bound - The lower bound
- 43. with everything assigned as loop
- 49. with every residue assigned the most
probable structure - The upper bound
- 75. with non-homologous data
37A Performance Bound Claim
- Bound is calculated by
- Taking only the exact sequence matches in the
training and testing sets - Assign the mostly seen value of that frame in the
training set as guess - Compare with actual value
- A bound for non-homologous training and testing
sets - A bound for carefully selected frame size
- Not too short (assignments would be almost
random) - Not too long (only unique frames will be
available)
38Upper Bound with Homologs
39Upper Bound without Homologs
40Tertiary Structure Prediction
- Predictions based on backbone dihedral angles
- Phi and Psi angles fully define the tertiary
structure - Goal
- Discover the right level of granularity
41Data Set Selection
- PDB-Select
- A set of non-homologous proteins of high
resolution Hobohm Sander, 1994 - Data representation
- Frames of 9 residues
- Residue names plus residue properties
- Hydrophobicity, polarity, volume, charge etc.
- Train/Validation/Test
42Data Discretization
- Phi/Psi angles are continuous
- We need a discrete representation to predict them
in a decision list - Split the (-180, 180) region into bins
- Split the Ramachandran into bins
43Ramachandran Plot (1)
44Ramachandran Plot (2)
Karplus, 1996
45How to Predict?
- Predictions using sequence information
- No homology information
- Predicted angles may be incorporated
- Upper bounds will be given
- Accuracy
- Percent of correct estimates
- RMSD of phi and psi angles
46Using Predicted Angles
47Performance Accuracy
48Performance RMSD
49Performance Backbone RMSD
50Performance Input Features
51Performance Real Prediction
52Future Work
- For tertiary structure predictions.
- The two-leveled approach may be applied to
tertiary structure predictions - Homology information may be incorporated
- For secondary structure predictions.
- Should find better homologues and better
representations - Incorporating sequence and homology information
in the structure to structure part may be an
option - For both predictions
- A reliability index for predicted structure