Identifying functional residues of proteins from sequence info - PowerPoint PPT Presentation

About This Presentation
Title:

Identifying functional residues of proteins from sequence info

Description:

given a set of relative entropies for each subfamily for each position ... to reduce noise, restrict to values that have above average positive relative entropy ... – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 12
Provided by: andrewwill
Category:

less

Transcript and Presenter's Notes

Title: Identifying functional residues of proteins from sequence info


1
  • Identifying functional residues of proteins from
    sequence info
  • Using MSA (multiple sequence alignment)
  • - search for remote homologs using HMMs or
    profiles
  • Remote homologs with no known structure
  • Given a large, diverse superfamily
  • protein may evolve different function or subtype
  • different substrate specificity or activity
  • proteins with similar fold but different
    function
  • Past methods used phylogenetic trees
  • map unknown protein to one of the branches of
    the tree produced
  • but- maybe diverged to long ago to be clearly
    identified
  • co-evolution of multiple features
  • possible convergent evolution of molecular
    function at aa level

2
  • Other methodologies
  • Analysis/prediction of subtype from sequence
    alignments
  • characterization of aa residues, looking for
    significant substitutions
  • gathering sequences into subgroups, comparing
    each subgroup
  • Principal component analysis (Casari et al, 1995)
  • looks for functional residues conserved in
    protein families
  • Evolutionary Trace (Lichtarge et al)
  • Phylogenetic Inference (Sjolander et al)

3
  • Goal identify regions conferring sub-family
    specificity
  • Secondary goal predict subtypes of orphan
    sequences
  • Input to algorithm
  • multiple sequence alignment (MSA) of sequences
    in a protein family
  • classification of subfamilies of sequences from
    above MSA
  • For the given subtypes (or subfamilies) provided
  • get the MSA subalignment for each subfamily
  • build a HMM profile for each sub-family MSA
  • Rationale generate pseudocounts and account for
    statistical bias
  • For each subalignment profile
  • The profile value for amino acid x at position i
    for subfamily j over all amino acids at a given
    position will sum to 1. (probability of finding
    an amino acid x at position i in the subfamily j)

4
  • Relative Entropy
  • measure of distance between two probability
    distributions
  • Relative entropy produces a value gt 0. (value
    of 0 for two identical distributions)
  • for each position i in a subfamily s
  • For each position, a RE value for a subfamily s
    vs s-bar (all other subfamilies)
  • Cumulative Relative Entropy
  • given a set of relative entropies for each
    subfamily for each position
  • To produce a CRE for a given position i in the
    MSA across all subfamilies.

5
  • Given this set of cumulative relative entropy
    measures
  • one for each position in MRA- you take the Z
    score.
  • Standard statistical measure- the number of std
    devs above/below the mean
  • tells you which residue positions vary strongly
    in aa distribution between families
  • empirically, Z gt 3 correlates with functional
    residue
  • For position i, which amino acid is dominant in a
    given subfamily
  • find probability of observing aa x at position
    in subfamily s vs not-s

6
Subfamily data
  • What exactly constitutes a family or subfamily?
  • not always clear
  • automated tree generation could not separate
    data into clear subfamilies
  • use of PFAM alignments and SWISSPROT data
  • Subfamilies are not clearly defined in databases
  • divided proteins from PFAM database into
    subfamilies based on SWISSPROT data
  • keyword search limited to enzymatic activity
    string in SWISSPROT
  • put into groups, then checked for obvious
    mistakes
  • also eliminated divisions easily discernable by
    sequence comparison
  • 62 groupings from 42 alignments remained
  • randomly pick 11 to produce 42 groups over 42
    alignments

7
Subfamilies
  • Four very large families to test their results on
  • nucleotidyl cyclases
  • eukaryotic protein kinases
  • lactate/malate dehydrogenases
  • trypsin-like serine proteases
  • Nucleotidyl cyclases
  • membrane-attached or cytosolic, cyclize (GTP -gt
    cGMP) or (ATP -gt cAMP)
  • found residues 1018, 938, which correlate with
    previous results
  • also identified residues which have not been
    tested experimentally
  • Protein kinases
  • phosphorylate serine/threonine or tyrosine
    residues
  • compare to experimental result- some ser/thr vs
    tyr kinase differences not detected
  • inconsistency (no conservation) within the
    subfamily
  • residues which were common to both ser/thr and
    tyr kinases

8
Subfamilies (cont)
  • Lactate/Malate Dehydrogenases
  • common to a very wide variety of organisms-
    highly divergent
  • results mostly as expected- but a few residues
    identified outside of active site
  • Serine Proteases
  • cut protein backbone- differing specificity as
    to where (what aa precedes cut)
  • specificity pocket determines where protease can
    bind
  • identified 2 out of 3 of experimentally-determine
    d pocket residues
  • (third had a low z-score because of tolerance in
    one protein family)
  • also identified a few residues outside of the
    active site

9
Prediction of Protein Subfamily
  • Sequence Similarity
  • straight similarity with other sequences
    (ignoring gaps)
  • BLAST
  • database search, assign to nearest subfamily
    with best alignment
  • HMM method
  • align sequence of sub-type to all HMMs of
    subfamilies and assign it to best alignment
  • will attempt to do iterative optimization of
    match
  • Profile method
  • take original HMM, and probability profile
  • Sub-profile method
  • only use residues in above formula that have a
    positive Z-score
  • to reduce noise, restrict to values that have
    above average positive relative entropy

10
Casari, et al. (1995) A method to predict
functional residues in proteins
  • Input a multiple-sequence alignment
  • each sequence is converted to a vector of size
    (20 l) where l is length of the alignment
  • Generation of of N x (20l) matrix
  • one sequence produces a vector of dimensions
    20l
  • N sequences to produce N vectors of dimension
    20l
  • Use Principal Component Analysis
  • get the covariance matrix- tells you how factors
    are correlated to one another
  • eliminate covariance by finding
    eigenvectors/eigenvalues of covariance matrix
  • largest eigenvalues and corresponding
    eigenvectors give you principal components
  • ie the largest factors determining distribution
    of your dataset
  • they take the three largest (the largest of
    which represents consensus sequence)
  • project their 20l dimensional data onto those 3
    dimensions
  • this can be used to predict a protein subfamily
    for a given protein

11
General Weirdness
  • Construction of a comparison matrix
  • take matrix x (matrix transpose)
  • solve for eigenvectors and eigenvalues as before
  • Columns of f represent amino acid values and
    positions
  • becomes possible to examine individual amino
    acid residues and positions
  • plotted on graph, shows residue correlation to
    type of protein subfamily
  • does this actually work?
Write a Comment
User Comments (0)
About PowerShow.com