Protein Structure Prediction using Decision Lists - PowerPoint PPT Presentation

1 / 52

About This Presentation

Title:

Protein Structure Prediction using Decision Lists

Description:

Protein Structure Prediction using Decision Lists. Volkan KURT ... Glutamine or Glutamic Acid GLX (GLU) Asparagine or Aspartic Acid ASX (ASN) ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 53

Provided by: Deniz8

Category:

more less

Transcript and Presenter's Notes

Title: Protein Structure Prediction using Decision Lists

1
Protein Structure Prediction using Decision Lists

Volkan KURT
Department of Computational Sciences

2
Introduction

Protein Structure
What is Secondary Structure?
What is Tertiary Structure?
Secondary structure Prediction
What are decision lists?
GDL in Action
Tertiary Structure Prediction

3
Protein Structure

Primary Structure
Sequences
Secondary Structure
Frequent Motifs
Tertiary Structure
Functional Form
Quaternary Structure
Protein complexes

4
Primary Structure

Sequence information
Contains only aminoacid sequences
24 amino acid codes present
20 standard residues
Glutamine or Glutamic Acid ? GLX (GLU)
Asparagine or Aspartic Acid ? ASX (ASN)
Others (Non-natural/Unknown) ? X
Selenocysteine, Pyrrolysine

5
Secondary Structure

Rigid structure motifs
Do not give information about coordinates of
residues
Can be seen as a one-dimensional reduction of the
tertiary structure
If accurately predicted, can be used to
Predict the final (tertiary) structure
Predict the fold type (all-alpha/all-beta etc.)

6
Common Secondary Structure Motifs
alfa-helix
Parallel beta-sheet
Antiparallel beta-sheet
7
Tertiary/Quaternary Structure

Tertiary Structure
The functional form
Coordinates of residues in the space
Quaternary Structure
Protein Protein complexes
Assembly of one or more proteins

8
Tertiary / Quaternary Structure
9
Structure Prediction

Easier to determine sequence than structure
Predictions may help close the gap

10
Secondary Structure Prediction

Assesment of Prediction Accuracy
Common Strategy
Methods in Literature
Decision Lists
Prediction using GDL
A Performance Bound

11
Secondary Structure Prediction

Predictions based on
Sequence Information
Multiple Sequence Alignments
Various algorithms present based on
Information Theory
Machine Learning
Neural Networks etc.

12
Assessment of Accuracy

Determination method
DSSP
Performance Metric
Q3 accuracy
Three state accuracy (helix/strand/loop)
Data set selection
Non-redundancy
Homology Information
Multiple Sequence Alignments
Cross-Validation

13
Two Levels of Prediction
Sequence

First Level
Sequence to Structure
Input
Sequence Information
Multiple Sequence Alignments
Method
Machine Learning
Neural Networks
Output
Secondary Structure

MSA
Sequence to Structure
Secondary Structure
14
Two Levels of Prediction
Secondary Structure

Second Level
Structure to Structure
Input
Structure Information
Method
Machine Learning
Neural Networks
Filter
Simple Filters
Jury Decisions
Output
Secondary Structure

Structure to Structure
Filter
Secondary Structure
15
GORV
Sequence
Secondary Structure
PSI-BLAST
6.5
66.9
Majority Vote
Information Function / Bayesian Statistics
Filter
Secondary Structure
Secondary Structure
73.4
Garnier et al, 2002
16
PHD
Secondary Structure
4.3
Neural Network
62.6 / 67.4
Jury Filter
3.4
Secondary Structure
70.8
61.7 / 65.9
Rost Sander, 1993
17
JNet
Profile
Secondary Structure
PSIBLAST HMMER2 CLUSTALW
Neural Network
Neural Network
Jury Jury Network
Secondary Structure
Secondary Structure
76.9
Cuff Barton, 2000
18
PSIPRED
Secondary Structure
Profiles
PSI-BLAST
Neural Network
Neural Network
Secondary Structure
Secondary Structure
76.3
Jones, 1999
19
Decision Lists

Machine Learning method
Simply, a list of rules
Each rule asserts a guess
Generalization by simple rule pruning
Output is human readable/understandable

20
Sample Decision List
LeftHelix and RightHelix
Helix
LeftStrand and RightStrand
Strand
LeftHelix and RightLoop
Helix
Else
Loop
21
GDL

Greedy Decision List
Start with a global (base) rule
At every step
Find the maximum gain rule
Append to previous list
Stop when gain change is 0

22
Rule Search

Initially evertyhing is predicted to be the
mostly seen structure (i.e. loop)

False Assignments
Correct Assignments
Training Set
Partition with respect to the Base Rule
23
Rule Search

At each step add the maximum gain rule

-
-

Partition with respect to the Second Rule
Partition with respect to the Base Rule
24
Data Representation

Frames of length W
Context of an aminoacid is represented by W
residues
(W-1)/2 to the left. (W-1)/2 to the right
If the frame exceeds terminii, they are
represented as NAN
GLX GLN. ASX ASN.
New found/Non Natural aas X

25
Sample Data

evealekkvaaLesvqalekkvealehg
Frame Size 5
Represents the features used in the prediction of
secondary structure for L (leucine)

26
GDL

DSSP assignments
Reduction
E (extended strand), B (b bridge)-gt Strand
H (a helix ), G (3-10 helix) -gt Helix
Others -gt Loop
Data set
CB513 set
7-fold cross-validation

27
2-level Algorithm

Sequence to Structure List
Find the first rule that matches the data point
Assign the output of that rule
A frame of 9 residues is input
Output Secondary Structure
Structure to Structure List
After all predictions are made, check for
possible improvements
A frame of 19 secondary structures is input
Output Secondary Structure

28
GDL
Sequence
Secondary Structure
PSI-Blast
6.67
GDL
GDL
Secondary Structure
Secondary Structure
60.48
62.54 / 69.21
29
GDL Performance

Performance of first level decision list
Without homologs 60.48 (29 to 66 rules)
With homologs 66.36 (46 to 68 rules)
Performance of second level decision list
Without homologs 62.54 (18 to 116 rules)
With homologs 69.21 (16 to 40 rules)

30
GDL Performance

Performance at 20 rules at both steps
Without homologs 62.15
With homologs 69.08
Possible to make a back-of-the-envelope structure
prediction using our model

31
GDL/PHD/GORV
32
Discussion - Why GDL?

Amazingly simple models
With as low as 20 rules in the first level and as
low as 20 rules in the second
Rules (Models) are human-readable
Biological rules may be inferred
Second level decision list may be used as a
filter for other algorithms

33
GDL Rules

The first three rules of the sequence-to-structure
level
58.86 performance (of 66.36)
First Rule
Everything ? Loop
Default rule

34
GDL Rule 2
35
GDL Rule 3
36
A Performance Bound Claim

Using only sequence information. the highest
achievable performance has an upper bound
The lower bound
43. with everything assigned as loop
49. with every residue assigned the most
probable structure
The upper bound
75. with non-homologous data

37
A Performance Bound Claim

Bound is calculated by
Taking only the exact sequence matches in the
training and testing sets
Assign the mostly seen value of that frame in the
training set as guess
Compare with actual value
A bound for non-homologous training and testing
sets
A bound for carefully selected frame size
Not too short (assignments would be almost
random)
Not too long (only unique frames will be
available)

38
Upper Bound with Homologs
39
Upper Bound without Homologs
40
Tertiary Structure Prediction

Predictions based on backbone dihedral angles
Phi and Psi angles fully define the tertiary
structure
Goal
Discover the right level of granularity

41
Data Set Selection

PDB-Select
A set of non-homologous proteins of high
resolution Hobohm Sander, 1994
Data representation
Frames of 9 residues
Residue names plus residue properties
Hydrophobicity, polarity, volume, charge etc.
Train/Validation/Test

42
Data Discretization

Phi/Psi angles are continuous
We need a discrete representation to predict them
in a decision list
Split the (-180, 180) region into bins
Split the Ramachandran into bins

43
Ramachandran Plot (1)
44
Ramachandran Plot (2)
Karplus, 1996
45
How to Predict?

Predictions using sequence information
No homology information
Predicted angles may be incorporated
Upper bounds will be given
Accuracy
Percent of correct estimates
RMSD of phi and psi angles

46
Using Predicted Angles
47
Performance Accuracy
48
Performance RMSD
49
Performance Backbone RMSD
50
Performance Input Features
51
Performance Real Prediction
52
Future Work

For tertiary structure predictions.
The two-leveled approach may be applied to
tertiary structure predictions
Homology information may be incorporated
For secondary structure predictions.
Should find better homologues and better
representations
Incorporating sequence and homology information
in the structure to structure part may be an
option
For both predictions
A reliability index for predicted structure