Protein Structure Prediction using Decision Lists - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Protein Structure Prediction using Decision Lists

Description:

Protein Structure Prediction using Decision Lists. Volkan KURT ... Glutamine or Glutamic Acid GLX (GLU) Asparagine or Aspartic Acid ASX (ASN) ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 53
Provided by: Deniz8
Category:

less

Transcript and Presenter's Notes

Title: Protein Structure Prediction using Decision Lists


1
Protein Structure Prediction using Decision Lists
  • Volkan KURT
  • Department of Computational Sciences

2
Introduction
  • Protein Structure
  • What is Secondary Structure?
  • What is Tertiary Structure?
  • Secondary structure Prediction
  • What are decision lists?
  • GDL in Action
  • Tertiary Structure Prediction

3
Protein Structure
  • Primary Structure
  • Sequences
  • Secondary Structure
  • Frequent Motifs
  • Tertiary Structure
  • Functional Form
  • Quaternary Structure
  • Protein complexes

4
Primary Structure
  • Sequence information
  • Contains only aminoacid sequences
  • 24 amino acid codes present
  • 20 standard residues
  • Glutamine or Glutamic Acid ? GLX (GLU)
  • Asparagine or Aspartic Acid ? ASX (ASN)
  • Others (Non-natural/Unknown) ? X
  • Selenocysteine, Pyrrolysine

5
Secondary Structure
  • Rigid structure motifs
  • Do not give information about coordinates of
    residues
  • Can be seen as a one-dimensional reduction of the
    tertiary structure
  • If accurately predicted, can be used to
  • Predict the final (tertiary) structure
  • Predict the fold type (all-alpha/all-beta etc.)

6
Common Secondary Structure Motifs
alfa-helix
Parallel beta-sheet
Antiparallel beta-sheet
7
Tertiary/Quaternary Structure
  • Tertiary Structure
  • The functional form
  • Coordinates of residues in the space
  • Quaternary Structure
  • Protein Protein complexes
  • Assembly of one or more proteins

8
Tertiary / Quaternary Structure
9
Structure Prediction
  • Easier to determine sequence than structure
  • Predictions may help close the gap

10
Secondary Structure Prediction
  • Assesment of Prediction Accuracy
  • Common Strategy
  • Methods in Literature
  • Decision Lists
  • Prediction using GDL
  • A Performance Bound

11
Secondary Structure Prediction
  • Predictions based on
  • Sequence Information
  • Multiple Sequence Alignments
  • Various algorithms present based on
  • Information Theory
  • Machine Learning
  • Neural Networks etc.

12
Assessment of Accuracy
  • Determination method
  • DSSP
  • Performance Metric
  • Q3 accuracy
  • Three state accuracy (helix/strand/loop)
  • Data set selection
  • Non-redundancy
  • Homology Information
  • Multiple Sequence Alignments
  • Cross-Validation

13
Two Levels of Prediction
Sequence
  • First Level
  • Sequence to Structure
  • Input
  • Sequence Information
  • Multiple Sequence Alignments
  • Method
  • Machine Learning
  • Neural Networks
  • Output
  • Secondary Structure

MSA
Sequence to Structure
Secondary Structure
14
Two Levels of Prediction
Secondary Structure
  • Second Level
  • Structure to Structure
  • Input
  • Structure Information
  • Method
  • Machine Learning
  • Neural Networks
  • Filter
  • Simple Filters
  • Jury Decisions
  • Output
  • Secondary Structure

Structure to Structure
Filter
Secondary Structure
15
GORV
Sequence
Secondary Structure
PSI-BLAST
6.5
66.9
Majority Vote
Information Function / Bayesian Statistics
Filter
Secondary Structure
Secondary Structure
73.4
Garnier et al, 2002
16
PHD
Secondary Structure
4.3
Neural Network
62.6 / 67.4
Jury Filter
3.4
Secondary Structure
70.8
61.7 / 65.9
Rost Sander, 1993
17
JNet
Profile
Secondary Structure
PSIBLAST HMMER2 CLUSTALW
Neural Network
Neural Network
Jury Jury Network
Secondary Structure
Secondary Structure
76.9
Cuff Barton, 2000
18
PSIPRED
Secondary Structure
Profiles
PSI-BLAST
Neural Network
Neural Network
Secondary Structure
Secondary Structure
76.3
Jones, 1999
19
Decision Lists
  • Machine Learning method
  • Simply, a list of rules
  • Each rule asserts a guess
  • Generalization by simple rule pruning
  • Output is human readable/understandable

20
Sample Decision List
LeftHelix and RightHelix
Helix
LeftStrand and RightStrand
Strand
LeftHelix and RightLoop
Helix
Else
Loop
21
GDL
  • Greedy Decision List
  • Start with a global (base) rule
  • At every step
  • Find the maximum gain rule
  • Append to previous list
  • Stop when gain change is 0

22
Rule Search
  • Initially evertyhing is predicted to be the
    mostly seen structure (i.e. loop)

False Assignments
Correct Assignments
Training Set
Partition with respect to the Base Rule
23
Rule Search
  • At each step add the maximum gain rule


-
-

Partition with respect to the Second Rule
Partition with respect to the Base Rule
24
Data Representation
  • Frames of length W
  • Context of an aminoacid is represented by W
    residues
  • (W-1)/2 to the left. (W-1)/2 to the right
  • If the frame exceeds terminii, they are
    represented as NAN
  • GLX GLN. ASX ASN.
  • New found/Non Natural aas X

25
Sample Data
  • evealekkvaaLesvqalekkvealehg
  • Frame Size 5
  • Represents the features used in the prediction of
    secondary structure for L (leucine)

26
GDL
  • DSSP assignments
  • Reduction
  • E (extended strand), B (b bridge)-gt Strand
  • H (a helix ), G (3-10 helix) -gt Helix
  • Others -gt Loop
  • Data set
  • CB513 set
  • 7-fold cross-validation

27
2-level Algorithm
  • Sequence to Structure List
  • Find the first rule that matches the data point
  • Assign the output of that rule
  • A frame of 9 residues is input
  • Output Secondary Structure
  • Structure to Structure List
  • After all predictions are made, check for
    possible improvements
  • A frame of 19 secondary structures is input
  • Output Secondary Structure

28
GDL
Sequence
Secondary Structure
PSI-Blast
6.67
GDL
GDL
Secondary Structure
Secondary Structure
60.48
62.54 / 69.21
29
GDL Performance
  • Performance of first level decision list
  • Without homologs 60.48 (29 to 66 rules)
  • With homologs 66.36 (46 to 68 rules)
  • Performance of second level decision list
  • Without homologs 62.54 (18 to 116 rules)
  • With homologs 69.21 (16 to 40 rules)

30
GDL Performance
  • Performance at 20 rules at both steps
  • Without homologs 62.15
  • With homologs 69.08
  • Possible to make a back-of-the-envelope structure
    prediction using our model

31
GDL/PHD/GORV
32
Discussion - Why GDL?
  • Amazingly simple models
  • With as low as 20 rules in the first level and as
    low as 20 rules in the second
  • Rules (Models) are human-readable
  • Biological rules may be inferred
  • Second level decision list may be used as a
    filter for other algorithms

33
GDL Rules
  • The first three rules of the sequence-to-structure
    level
  • 58.86 performance (of 66.36)
  • First Rule
  • Everything ? Loop
  • Default rule

34
GDL Rule 2
35
GDL Rule 3
36
A Performance Bound Claim
  • Using only sequence information. the highest
    achievable performance has an upper bound
  • The lower bound
  • 43. with everything assigned as loop
  • 49. with every residue assigned the most
    probable structure
  • The upper bound
  • 75. with non-homologous data

37
A Performance Bound Claim
  • Bound is calculated by
  • Taking only the exact sequence matches in the
    training and testing sets
  • Assign the mostly seen value of that frame in the
    training set as guess
  • Compare with actual value
  • A bound for non-homologous training and testing
    sets
  • A bound for carefully selected frame size
  • Not too short (assignments would be almost
    random)
  • Not too long (only unique frames will be
    available)

38
Upper Bound with Homologs
39
Upper Bound without Homologs
40
Tertiary Structure Prediction
  • Predictions based on backbone dihedral angles
  • Phi and Psi angles fully define the tertiary
    structure
  • Goal
  • Discover the right level of granularity

41
Data Set Selection
  • PDB-Select
  • A set of non-homologous proteins of high
    resolution Hobohm Sander, 1994
  • Data representation
  • Frames of 9 residues
  • Residue names plus residue properties
  • Hydrophobicity, polarity, volume, charge etc.
  • Train/Validation/Test

42
Data Discretization
  • Phi/Psi angles are continuous
  • We need a discrete representation to predict them
    in a decision list
  • Split the (-180, 180) region into bins
  • Split the Ramachandran into bins

43
Ramachandran Plot (1)
44
Ramachandran Plot (2)
Karplus, 1996
45
How to Predict?
  • Predictions using sequence information
  • No homology information
  • Predicted angles may be incorporated
  • Upper bounds will be given
  • Accuracy
  • Percent of correct estimates
  • RMSD of phi and psi angles

46
Using Predicted Angles
47
Performance Accuracy
48
Performance RMSD
49
Performance Backbone RMSD
50
Performance Input Features
51
Performance Real Prediction
52
Future Work
  • For tertiary structure predictions.
  • The two-leveled approach may be applied to
    tertiary structure predictions
  • Homology information may be incorporated
  • For secondary structure predictions.
  • Should find better homologues and better
    representations
  • Incorporating sequence and homology information
    in the structure to structure part may be an
    option
  • For both predictions
  • A reliability index for predicted structure
Write a Comment
User Comments (0)
About PowerShow.com