Predicting cellular localization - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Predicting cellular localization

Description:

... neural networks, DP, multiple alignment one of the most accurate prediction methods Including homology helps TMAP (Persson and Argos, 1996) ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 61
Provided by: KimmenSj2
Category:

less

Transcript and Presenter's Notes

Title: Predicting cellular localization


1
Predicting cellular localization
  • Bioe C144/C244
  • Fall 2010

2
Eukaryotic protein localization
3
Why localize?
  • Subcellular localization is a key functional
    characteristic of proteins.
  • To co-operate for a common physiological function
    (metabolic pathway, signal transduction cascade,
    structural associate etc.), proteins must be
    localized in the same cellular compartment.

http//mendel.imp.univie.ac.at/CELL_LOC/
4
Correct Localization is required for
pathway/complex formation
  • A set of many co-operating proteins is
    responsible for a physiological function
    (metabolic pathway, signal transduction cascade,
    structural associate etc.).
  • Subcellular localization is an essential
    characteristic for this level.
  • For proper functioning, the protein has to be
    translocated to the correct intra- or
    extracellular compartments in a soluble form or
    attached to a membrane.

http//mendel.imp.univie.ac.at/CELL_LOC/
5
Computer-Aided Approaches for the Assignment of
Subcellular Localization
  • Automatic, computer-aided selection methods are
    clearly the only way to identify interesting
    attractive target proteins among the haystack of
    new gene sequence data.
  • One of the helpful decision criteria is the
    probable subcellular localization of the gene
    products.
  • For example, in a search for virulence factors of
    pathogenic bacteria or easily accessible entry
    points for pharmaceutical drugs, extracellular
    proteins are good candidates

http//mendel.imp.univie.ac.at/CELL_LOC/
6
Computer-Aided Approaches for the Assignment of
Subcellular Localization
  • For a primary screening of gene sequences, the
    first step is a general classification into
    intracellular, extracellular, membrane-related
    (both with transmembrane regions and with lipid
    anchors) and viral proteins.
  • In the case of Eukaryotes, intracellular location
    is desirable to be further detailed with respect
    to organelles (mitochondrium, chloroplast,
    endoplasmatic reticulum and Golgi apparatus,
    nucleus).

http//mendel.imp.univie.ac.at/CELL_LOC/
7
Predicting subcellular localization by homology
with characterized proteins
  • Subcellular localization can often be assigned by
    searching for homologous sequences.
  • This is an easy task for a few new proteins but
    very difficult for thousands of sequences
    contained in new genomes.
  • Even with the most advanced retrieval systems and
    relying on the well-annotated SWISS-PROT, it is
    impossible to get exhaustive classifications with
    respect to subcellular localization.

http//mendel.imp.univie.ac.at/CELL_LOC/
8
Prediction method 2 analysis of sequence
properties
  • First attempts to classify proteins with respect
    to cellular localization based on amino acid
    sequence properties Nishikawa and Ooi (J.Biochem.
    1982)
  • amino acid composition, disulphide bonds, the
    secondary structural class related to function
    and localization
  • Early results were promising, but based on a
    small sample.

http//mendel.imp.univie.ac.at/CELL_LOC/
9
Prediction by signal peptide detection
  • Some proteins have sequence signals that
    determine their translocation to organelles or
    outside the cell
  • Claros et al. Curr.Op.Struct.Biol. (1997).
  • These patterns are not clear cut, especially for
    the intracellular organelle targeting peptides
  • prediction accuracy is limited
  • Nielsen et al. Prot.Eng. (1997) v.10, 1
  • Combinations of compositional and signal sequence
    analyses have been used in expert systems for the
    prediction of cellular localization
  • Nakai Kanehisa Genomics (1992)
  • In general not systematic and not rigorously
    tested

http//mendel.imp.univie.ac.at/CELL_LOC/
10
Extracting information from sequence
  • Signal peptides short sequences in the protein
    used to target the protein for specific cellular
    compartments.
  • Signal patches (clusters of amino acids in close
    proximity in 3D structure, but distant in primary
    sequence) are also found
  • Examination of amino acids at structure surface
    can be particularly helpful subtle preferences
    of different amino acids for different
    environments

11
Trans-membrane helix prediction
12
(No Transcript)
13
Helical membrane proteins
  • Key components in cell-cell signalling
  • Mediate transport of ions and solutes across
    membrane
  • Crucial for recognition of self
  • Major class of drug targets
  • More than 50 of prescription drugs act on GPCRs
    (G-protein coupled receptors)
  • Multi-billion dollar industry

14
Many predicted few known
  • Solved structures available for very few membrane
    proteins
  • Predicted 10K helical membrane proteins in human
    genome (25 of genome!)

Chen and Rost, 2002
15
Helical membrane proteins challenge bioinformatics
  • Very little info about 3D structures
  • Very hard to crystallize
  • Hardly traceable by nuclear magnetic resonance
    (NMR) spectroscopy
  • Relatively easy to identify (rough) location of
    helices through low-resolution experiments
  • C-terminal fusion with indicator proteins
  • Antibody binding

Chen and Rost, 2002
16
Concepts for predicting TM helix location and
topology
  • Hydrophobicity scales provide simple criteria for
    prediction
  • TM helices are predominantly non-polar
  • TM helix length between 12-35 aa
  • Globular regions between membrane helices
    typically shorter than 60 aa
  • Positive inside rule von Heijne
  • Connecting loop regions on inside have more
    positive charge than loop regions on outside

Chen and Rost, 2002
17
Hydrophobicity scales
  • Kyte and Doolittle (20 yrs ago)
  • Hydropathy scale, moving window approach
  • Window of 19 residues discriminated best between
    membrane and globular
  • Other work equally successful
  • Drawback methods fail to discriminate between
    membrane regions and highly hydrophobic globular
    segments

Chen and Rost, 2002
18
Other clues
  • Amino acid preferences for membrane and
    non-membrane proteins
  • Training data for methods derived from proteins
    identified as containing TM helices, as well as
    other secondary structure types
  • Higher accuracy

Chen and Rost, 2002
19
Including topology helps
  • TopPred (von Heijne, 1992)
  • Topology prediction, using hydrophobicity
    analysis, possible topologies ranked by
    positive-inside rule
  • SOSUI (Hirokawa et al, 1998)
  • Combined KD hydropathy, amphiphilicity, relative
    and net charges, protein length

Chen and Rost, 2002
20
Including homology helps
  • Alignment of homologs known to help secondary
    structure prediction (Rost and Sander, 1993)
  • Note for 20-30 of proteins in any genome, no
    identifiable homologs can be found!
  • PHDhtm first method using homology info for
    membrane prediction
  • Uses neural networks, DP, multiple alignment
  • one of the most accurate prediction methods

Chen and Rost, 2002
21
Including homology helps
  • TMAP (Persson and Argos, 1996)
  • Derived amino acid propensities from known TMs
  • 4-residue caps of membrane helices
  • 21 residue TM segments
  • Found at outside of membrane N D G F P W Y V
  • Found mostly inside A R C K
  • Used these propensities to improve prediction

Chen and Rost, 2002
22
Grammatical rules
  • TMHMM pioneered building models of predicted
    membrane proteins in one consistent methodology
  • Sonnhammer et al 1998, Krogh et al 2001
  • Similar concept implemented in HMMTOP
  • Tusnady and Simon, 1998
  • MEMSAT similar to HMMTOP
  • Jones et al, 1994

Chen and Rost, 2002
23
Topology questions
  • The topology of a TM protein indicates its
    orientation with respect to the membrane
  • which regions are outside (extracellular) and
    which are cytoplasmic
  • Predicted topologies turn out to be wrong roughly
    as often as theyre correct

Chen and Rost, 2002
24
Sequence information aiding TM recognition
  • Hydrophobic stretches (for lipid bilayer)
  • Positive inside rule
  • Von Heijne 1986, 1994
  • Abundance of positively charged residues
  • Improved predictions through use of
  • sliding windows
  • Multiple alignment
  • Neural networks

Chen and Rost, 2002
25
Errors in TM prediction
  • Under-prediction (False negative)
  • Over-prediction (False positive)
  • False merge
  • two adjacent helices predicted to be one helix
  • False split
  • One long helix predicted to be two
  • Inexact placement of helices

Chen and Rost, 2002
26
Prediction accuracy (1)
  • Performance accuracy overestimated significantly!
  • developers have overrated their methods by
    15-50 Chen et al, unpublished
  • Why do developers overestimate their method
    accuracy?
  • Validation performed on proteins closely related
    to training sequences (and thus not indicative of
    performance on novel sequences)

Chen and Rost, 2002
27
Prediction accuracy (2)
  • Membrane helices are not entirely conserved
    across species
  • Implies that even related proteins may have
    different topologies ( TM helices, orientation)
    and perform different cellular functions
  • N.B. There is no indication that the authors
    meant to imply that proteins that are globally
    alignable have differences in their TM domain
    locations or numbers
  • Measures of accuracy of prediction not comparable
    across methods, due to lack of standard benchmark
  • Benchmark dataset now available at EBI

Chen and Rost, 2002
28
Chen et al findings
  • Most TM methods get the number of helices right
    for most membrane proteins
  • 86 of TMH residues predicted by best methods
  • 70-75 of proteins get all TM helices predicted
    correctly by top methods
  • Topology correct for only half of all proteins

Chen and Rost, 2002
29
Prediction accuracy (4)
  • Some papers have claimed that simple
    hydrophobicity scales are as accurate as more
    sophisticated methods
  • Chen et al disagree

Chen and Rost, 2002
30
Prediction accuracy (5)
  • All methods confuse membrane helices with signal
    peptides
  • Best separation provided by ALOM2 (Nakai and
    Kanehisa)
  • Optimized to sort proteins into classes of
    sub-cellular localization

Since Rosts paper, the Phobius server was
developed to integrate TM and signal peptide
prediction http//www.ebi.ac.uk/Tools/phobius/inde
x.html
Chen and Rost, 2002
31
Prediction accuracy (6)
  • Most methods wrongly predict membrane helices in
    globular proteins
  • Most methods overestimate their ability to
    distinguish between globular and membrane
    proteins

Chen and Rost, 2002
32
Emerging and future developments
  • Improved prediction by averaging over many
    methods (I.e., consensus approaches)
  • Promponas and colleagues CoPreTHi combined 7
    methods, requiring 3 to agree
  • Nilsson et al, 2000, used 5 methods
  • Accuracy correlated with number of methods
    agreeing

Chen and Rost, 2002
33
Emerging and future developments
Chen and Rost, 2002
  • Amphiphilic (aka amphipathic) alpha helix
    identification can improve prediction
  • Helical-membrane and signal peptide predictions
    must be combined explicitly
  • Best signal peptide prediction tool is SignalP
    (Nielsen et al 1997)
  • PSORT, HMMTOP and THHMM integrate these
    predictions
  • More thorough combination is still missing

Except, of course, for Phobius, released since
this paper
34
Emerging and future developments
  • Databases of TM proteins being produced and
    curated
  • Membrane-specific substitution matrices improve
    database search for TM proteins
  • Current substitution matrices based on globular
    proteins
  • Henikoff and Henikoff have membrane-helix-specific
    substitution matrix PHAT

Chen and Rost, 2002
35
Sequence conservation in TM domains
  • Residues on helix-helix interface tend to be more
    conserved than those facing the lipid bilayer
  • Conservation in TM helices greater than
    structurally variable regions but not as
    significant as enzyme active sites and other
    functionally critical regions (KS observation)

36
More data from structural studies of TM proteins
  • Solved membrane protein structures have also
    shown that helical propensities are different in
    the membrane.
  • Glycine and proline, which are thought to be
    helix-breakers in soluble proteins, occur in the
    transmembrane helices of cytochrome c oxidase
  • Tsukihara et al, 1995.
  • Studying known structures has revealed that
    aromatic residues are often in the bilayer
    interface, possibly anchoring the transmembrane
    helix in the bilayer
  • Pawagi et al, 1994.

37
More data from structural studies
  • Serine and threonine can satisfy hydrogen bond
    donors and acceptors by hydrogen bonding to
    backbone carbonyls, making membrane localization
    favorable (Engelman et al 1986)
  • Analysis of solved membrane proteins show TM
    length ranges from 14-36 aa (varying due to
    variations in lipid bilayer width)
  • Canonical alpha helix prediction methods derived
    from soluble proteins are not as effective at
    predicting TM-located helices

38
TMHMM provides a grammar to parse sequences into
subregions
39
TMHMM author findings
  • TMHMM correctly predicts 9798 of the
    transmembrane helices.
  • TMHMM can discriminate between soluble and
    membrane proteins with both specificity and
    sensitivity better than 99
  • although the accuracy drops when signal peptides
    are present
  • This high degree of accuracy allowed authors to
    predict reliably integral membrane proteins in a
    large collection of genomes.
  • Based on these predictions, authors estimate that
    2030 of all genes in most genomes encode
    membrane proteins
  • which is in agreement with previous estimates.
  • Proteins with Nin-Cin topologies are strongly
    preferred in all examined organisms
  • except Caenorhabditis elegans, where the large
    number of 7TM receptors increases the counts for
    Nout-Cin topologies.

40
Aspects of model
  • Specialized modeling of various regions
  • Helix caps
  • Middle of helix
  • Regions close to membrane
  • Globular domains (all modeled identically)
  • TM amino acid stats derived from known TM domains

41
Training data
42
Signal peptide prediction
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
Chloroplast transit peptides are hard to detect
47
(No Transcript)
48
(No Transcript)
49
A plant GPCR??
Arabidopsis Thaliana GCR2
50
only one Arabidopsis putative GPCR protein
(GCR1) has been characterized in plants (1720),
and no ligand has been defined for any plant GPCR
51
Transmembrane structure prediction suggests that
GCR2 is a membrane protein with seven
transmembrane helices
-Despite bold claim of 7TM GPCR, only two
prediction servers used, no confidence values
indicated, and figure ended up in Supplemental
Material!
DAS
TMPred
52
(No Transcript)
53
Liu et al predicted GCR2 as a seven-transmembrane
protein (7TM), using the TMpred and DAS
programs, but did not report score thresholds to
evaluate the confidence of these predictions.
TMpred and DAS are known to erroneously predict
transmembrane helices within soluble proteins
(55 and 83 false positive rates, respectively)
GCR2 alignment with LanC superfamily (non 7-TM
GPCR)
54
(No Transcript)
55
(No Transcript)
56
We initially predicted that GCR2 was a
seventransmembrane protein using TMpred and
DAS software programs (2). We further used 12
distinct software programs to predict the
topological structure of GCR2 and found that 9 of
them (TMHMM, SOUSI, and DAS TMfilter
excluded) showed that GCR2 is a transmembrane
protein with various numbers of transmembrane
domains. TMHMM has underpredicted
transmembrane domains in many instances (3), and
the only other reported GPCR in Arabidopsis, GCR1
(4), was predicted to be a three-transmembrane
protein by SOSUI. In addition, about 14 of
known transmembrane proteins (established by
crystal structure or biochemical evidence) cannot
be correctly predicted by available software (3).
Thus, computational prediction of membrane
proteins is not yet a mature science and mainly
serves to generate hypotheses for experimental
testing
57
(No Transcript)
58
Discussion
Based on the evidence presented who do you think
is right? TM prediction and validation is
challenging both bioinformaticians
and Experimentalists alike!
59
TM/Signal Peptide/Localization prediction servers
  • Phobius http//phobius.sbc.su.se
  • -combined topology and signal peptide prediction
  • TMHMM http//www.cbs.dtk.dk/services/TMHMM/
  • -TM helix prediction
  • TargetP http//www.cbs.dtu.dk/services/TargetP/
  • -subcellular localization of eukaryotic proteins
  • SignalP http//www.cbs.dtu.dk/services/TargetP/
  • -predicts the presence and location of signal
    peptide cleavage sites

60
Summary points
  • Protein localization is a critical aspect of
    protein function
  • Methods for predicting localization can
    over-estimate their expected accuracy
  • Datasets used in validation typically differ from
    one method to the next, so results are not
    comparable
  • Expect false positive and false negative
    predictions
  • Consensus prediction using various types of
    information and predictions are your best bet for
    improving accuracy
Write a Comment
User Comments (0)
About PowerShow.com