The Domain Structure of Proteins: Prediction and Organization. - PowerPoint PPT Presentation

About This Presentation
Title:

The Domain Structure of Proteins: Prediction and Organization.

Description:

... searches and knowledge of sequence termini to delineate domain boundaries using ... Methods that rely on expert knowledge of protein families to construct models ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 48
Provided by: golan3
Category:

less

Transcript and Presenter's Notes

Title: The Domain Structure of Proteins: Prediction and Organization.


1
The Domain Structure of Proteins Prediction and
Organization.
  • Golan Yona
  • Dept. of Computer Science
  • Cornell University
  • (joint work with Niranjan Nagarajan)

Golan Yona, Cornell University
2
PDB 1a8y 367aa long MKIIRIETSRIAVPLTKPFKTALRTVYTA
ESVIVRITYDSGAVGWGEAPPTLVITGDSM
3
The domain structure of a protein
  • A domain is considered the fundamental unit of
    protein structure, folding, function, evolution
    and design.
  • Compact
  • Stable
  • Folds independently?
  • Has a specific function

4
A protein is a combination of domains
Protein1 Protein2 Protein3
5
Any signals that might indicate domain boundaries?
  • A very weak signal if any in the sequence
  • Usually domain delineation is done based on
    structure
  • Best methods available manual!
  • But structural information is sparse..

6
Definitions and assumptions
  • Domain continuous sequence that corresponds to
    an elemental building block of protein folds.
  • A subsequence that is likely to be stable as an
    independent folding unit.
  • Was formed as an independent unit, and later was
    combined with others more complex functions.
  • There are traces of the autonomous units..

7
First step..
  • Gather data database search
  • Histogram of matches is informative but noisy
  • Mutations, insertions, deletions, conflicting
    evidence

sequence
8
Previous methods
  • Methods based on the use of similarity searches
    and knowledge of sequence termini to delineate
    domain boundaries using heuristics/rules (MKDOM,
    Domainer, DIVCLUS, DOMO).
  • Methods that rely on expert knowledge of protein
    families to construct models like HMMs to
    identify other members of the family (Pfam,
    TigrFam, SMART).
  • Methods that try to infer domain boundaries by
    using sequence information to predict tertiary
    structure first (SnapDragon. Rigdens covariance
    analysis)
  • Methods that use multiple alignments to predict
    domain boundaries (PASS, Domination).
  • Others..(e.g. CSA and DGS guess based on size)

9
How do you evaluate the different methods?
  • No universal measures
  • A variety of qualitative and quantitative
    evaluation criteria, external resources and
    manual analysis are used to verify domain
    boundaries

10
Method outline
  • Source/test data SCOP
  • Processed data - alignments
  • Learning system
  • Domain-information-content scores
  • NN
  • Probabilistic model
  • Evaluation
  • A Multi-Expert System for the Automatic
    Detection of Protein Domains from Sequence
    Information Niranjan Nagaragan and Golan Yona,
    in the proceedings of RECOMB2003

11
Overview
Intron Boundaries
DNA DATA
Seed Sequence
blast search
Sequence Participation
Multiple Alignment
Secondary Structure
Entropy
Neural Network
Correlation
Contact Profile
Physio-Chemical Properties
Final Predictions
12
The source/test data set
  • PDB structures with their partitions into domains
    as defined in SCOP
  • 1ctf domain1 1-76 domain2 77-123
  • Remove sequences shorter than 40 aa and almost
    identical entries

13
Alignments
  • Search each query against a database of 1
    million non-redundant sequences
  • Remove fragments first
  • Two phase alignment procedure
  • First phase blast
  • Second phase multiple iteration psi-blast
  • Select one representative from each group of
    similar proteins
  • Remove proteins that are less than 90 covered
    (missing information)
  • Number of domains ranging from 1-7
  • Final set 605 multi-domain proteins and 576
    single domain proteins (1/4)

14
The domain-information-content of an alignment
column
  • Measures that (are believed) to reflect
    structural properties of proteins
  • A total of 20 measures
  • Conservation measures
  • Consistency and correlation measures
  • Measures of structural flexibility
  • Residue type based measures
  • Predicted secondary structure information
  • Intron-exon data

15
Conservation measures
  • Entropy some positions are more conserved than
    others
  • Class entropy some positions have preference
    towards a class of amino-acids (similar
    physio-chemical properties)
  • Evolutionary pressure (span) sum of pairwise
    similarities
  • Motivation consider the mutual similarity of
    amino acids

16
Consistency and correlation measures
  • All domain appearances should maintain its
    integrity
  • Consistency difference in sequence counts
  • Asymmetric correlation consistency of individual
    sequences.
  • Symmetric correlation reinforcement by missing
    sequences
  • Measures are averaged over a window

17
Consistency and correlation measures cont.
  • Sequence termination strong but elusive
  • Fragments
  • Premature halt in alignment
  • Loosely aligned
  • Product of left and right termination scores
    given c sequences that terminate at a position,
    with evalues e1,e2,e3,ec

18
(No Transcript)
19
Measures of structural flexibility
  • Indel entropy variability indicates structural
    flexibility (likely to occur near domain
    boundaries)
  • Correlated mutations indicative of contacts
    Contact profiles

20
Contact profile
21
Residue type based measures
  • hydrophobic vs. hydrophilic
  • cystines and prolines
  • Classes of amino acids

Predicted secondary structures
  • Helices and strands are rigid
  • Loops are more abundant near domain boundaries

22
Intron-exon data
  • Exon boundaries are expected to coincide with
    domain boundaries

1
2
Protein1 Protein2 Protein3
1
2
1
3
3
2
23
Score refinement and normalization
  • Smoothing using a window w (optimized)
  • Unification to a single scale zscore over all
    positions

24
Maximizing the information content of scores
  • Opt for the most distinct distributions of domain
    positions vs. boundary positions
  • Affected by the parameters (w smoothing factor)
    and x (boundary window size)
  • Use the Jensen-Shannon divergence measure

25
Examples
26
  • Even measures with identical distributions may be
    informative in a mutli-variate model
  • To simplify model only the top 12 are selected

27
The learning system
  • A neural network is trained to model effectively
    the complex decision boundary surface
  • Predicts correctly 94 of domain positions and
    88 of the transitions in the test set
  • Also tried mapping from multiple positions (local
    input neighborhood) to single/multiple output

28
Overview
Intron Boundaries
DNA DATA
Seed Sequence
blast search
Sequence Participation
Multiple Alignment
Secondary Structure
Entropy
Neural Network
Correlation
Contact Profile
Physio-Chemical Properties
Final Predictions
29
Hypothesis evaluation
  • Simple model refine predictions
  • Significant fraction of the positions in a window
    centered at x should be predicted as transitions
  • Order transitions by their quality (depth of the
    minima) and reject all transitions that are
    within 30 residues from already predicted
    transitions

30
The domain generator model
  • Multiple hypotheses find the best one
  • Assume a model random generator that moves
    repeatedly between a domain state and a linker
    state and emits one domain or transition at a
    time according to different source probability
    distributions.
  • Total probability is the product

31
Formally..
  • S D1 D2 Dn
  • We are given a sequence S (multiple alignment) of
    length L and a possible partition into n domains
    DD1,D2,..Dn of lengths l1,l2,..,ln (NN output)
  • Find the partition that will maximize the
    posterior probability P(D/S)
  • Maximize the product of the likelihood and the
    prior

32
Calculating the prior P(D)
  • For an arbitrary protein of length L what is the
    probability to observe D
  • Approximate using a simplified model given the
    length of the protein, the generator selects the
    number of domains first and then selects the
    length of one domain at a time, considering the
    domains that were already generated.

33
The prior probabilities
  • Approximate P0(li/L) by P0(li) normalized to the
    relevant range.
  • P0(li/L) is derived based on experimental data

34
The prior probabilities (cont.)
  • Calculate Prob(n/L) Prob(n,L)/P(L)
  • 1
  • 2

35
The likelihood
  • Use probabilities of observed scores considering
    the two different sources
  • The model D partitions the sequence S into n
    domains and n-1 transitions D1,T1,D2,T2,,Tn-1,Dn
    that correspond to the subsequences
    s1,t1,s2,t2,..,tn-1,sn
  • Assume domains are independent of each other
    (additional test can be used)

36
likelihood
  • Each term P(si/Di) and P(tj/Tj) is a product over
    the probabilities of the individual positions,
    each one is estimated by the joint probability
    distribution of the 12 features
  • How to estimate this probability? (independence
    assumption does not hold)

37
(No Transcript)
38
Likelihood of individual position
  • Given k random variables X1,X2,..,Xk their joint
    prob. Distribution
  • Use first order dependencies
  • For each pair, calculate the distance between the
    joint prob. Distribution and the product of the
    marginal distributions

39
  • Sort all pairs based on their dependency, and
    pick the most dependent one (denoted by Y1, Y2)
    and start the expansion
  • Select the next one based on the strongest
    dependency with variables that are already in the
    expansion

40
  • Denote by ZPILLAR(Y) the random variable that Y
    is most dependent on
  • Of all possible dependencies involving Y3 pick
    P(Y3/Z) and add it to the expansion
  • Proceed until you exhaust all variables
  • Maximize support, minimize error
  • The expansion is different for domain and
    transition regions

41
Finally..
  • Enumerate all possible hypotheses, calculate the
    posterior probability for each one, and output
    the one that maximizes the prob.

42
Summary of results
  • Distance accuracy average distance of the
    predicted transitions from their associated SCOP
    transition points.
  • Distance sensitivity average distance of SCOP
    transitions from their associated predicted
    transition points.
  • Selectivity percentage of correct predictions
    (within 10 residues from SCOP transitions)
  • Coverage percentage of correctly identified SCOP
    transitions (within 10 residues from predicted
    transitions)

43
Examples
  • PDB ID 2gep
  • Domain Definition 8-72, 73-272, 273-352,
    353-497
  • Predicted Domains 1-75, 76-270, 271-352,
    353-497
  • PFam Definition 1-67, 273-345, 356-425

44
Examples
  • PDB ID 1b6s chain D
  • Domain Definition 1-78, 79-276, 277-355
  • Predicted Domains 1-73, 74-271, 272-355
  • PFam Definition 30-167

45
Examples
  • PDB ID 1acc
  • Domain Definition 14-735
  • Predicted Domains 1-158, 159-583, 584-735
  • PFam Definition 103-544

46
Conclusions
  • A method for predicting the domain structure of a
    protein from sequence information alone
  • Protein/DNA data, multiple features, optimization
    based on information theory principles, learning
    system and final prediction using the
    domain-generator model (with confidence values).
  • Exhaustive hypothesis evaluation
  • Fully automatic and fast
  • Perform very well even compared to the best
    manual and semi-manual methods out there (also on
    CATH data)
  • Dare to say can be used to verify domain
    assignments based on structural data
  • Improvements other learning systems, more
    features

47
Acknowledgments
  • Niranjan Nagarajan
  • SCOP
  • CATH
  • PSI-BLAST
  • Pfam
  • InterPro
  • NSF
Write a Comment
User Comments (0)
About PowerShow.com