Title: Protein Subcellular Localization
1Protein Subcellular Localization
- Shan Sundararaj
- ss23_at_ualberta.ca
- June 24, 2005
2Why is Localization Important?
- Function is dependent on context
- Co-localization of proteins of related function
- Valuable annotation for new proteins
- Design of proteins with specific targets
- Drug targeting
- Accessibility
- Membrane-bound gt cytoplasmic gt nuclear
3Bacteria
Gram Positive (3-4 states)
Gram Negative (5 states)
Extracellular
cytoplasm
cytoplasm
periplasm
cytoplasmic membrane
cytoplasmic membrane
cell wall
outer membrane
Extracellular
4Eukaryotic Cell
- Compartmentalized
- Diverse range of specific organelles
- Plants chloroplasts, chromoplasts, other
plastids - Muscle sarcoplasm
- Various endosomes, vesicles
(modified from Voet Voet, Biochemistry
Wiley-VCH 1992)
5Yet more categories
Chloroplast
Mitochondrion
Yeast specific
6Localization signaling
- Proteins must have intrinsic signals for their
localization a cellular address - E.g. N-terminal signal sequences
321 Nuclear Inner Membrane Lane Nucleus,
Intracellular county Eukaryotic Cell CL34V3M3
7Localization signaling
- Some signals are easily recognizable
- Signal peptidase cleavage site, consensus
sequence for secretion ? extracellular - Address printed neatly, postal code
- Others are difficult to understand
- Outer membrane b-barrel proteins, no consensus
sequence, few sequence restraints - Sloppy address, different kind of code that we
dont understand yet
8Experimental determination
- Since dont fully understand the language of
proteins, our knowledge must often come from
inference - Predicting localization is like sorting mail
based only on examples of where some mail has
gone before - Important to have good data sets of proteins with
known localizations
9Datasets
- Organelle DB (http//organelledb.lsi.umich.edu/)
- 25095 eukaryotic proteins from subcellular
proteomics studies - DBSubLoc (http//www.bioinfo.tsinghua.edu.cn/guot
ao/download.html) - Combines Swiss-Prot and PIR annotations (64051
proteins) - PSORTdb (http//db.psort.org/)
- Bacterial. 1591 Gram ve proteins, 574 Gram ve
proteins - SignalP (http//www.cbs.dtu.dk/ftp/signalp/)
- 940 plant and 2738 human proteins
- Yeast Protein Localization Server
(http//bioinfo.mbb.yale.edu/genome/localize/) - 2956 yeast proteins
10Experimental Methods
- Electron microscopy
- GFP tagging / fluorescence microscopy
- Subcellular fractionation detection
- Western blotting
- Mass spectrometry
11Fluorescence Microscopy
- Tag gene at either 3 or 5 end
- Using GFP (or RFP, YFP, CFP, etc.)
- Using an epitope tag and a fluorescently labeled
antibody - Careful of removing signal peptides!
- Also use a subcellular-specific marker or stain
- Visualize with confocal fluorescence microscopy
and analyze images for co-localization
12Confirmation by Co-localization (GFP/RFP merging)
13High-Throughput Experiments
- Kumar et al., Genes Dev 2002, 16707-719
- Epitope-tagged gt60 of ORFs, visualized with
fluorescently labeled antibody - 2744 localizations (44 of S. cerevisiae genes)
- Huh et al., Nature 2003, 425686-691
- GFP tagged all ORFs, RFP tagged compartments
- 4156 localizations (75 of S. cerevisiae genes)
- Combined, now nearly 87 of yeast proteins have a
localization annotation
14Subcellular Fractionation
- Fractionate cells into organelles and other
compartments using differential solubilization
and centrifugation - Once fractionated, take compartment of interest
and separate proteins - 2D gel or chromatography
- Identify separated proteins
- Mass spectrometry for high-throughput
- Western blot for specific proteins
15High-Throughput Experiments
- Lopez-Campistrous et al., Mol Cell Proteomics,
2005 - Subcellular fractionation of E. coli, 2D-gel
separation, MS-MS - 2,160 localizations to cytoplasm, inner membrane,
periplasm, and outer membrane
16Predictions from known data
- Enough experimental data exists to build highly
accurate computational predictors of localization
17Computational methods for predicting localization
- Motif based methods
- Expert rule based methods
- Artificial Neural Nets (ANN)
- Hidden Markov Models (HMM)
- Naïve Bayes (NB)
- Support Vector Machines (SVM)
- Combination of above methods
18Predictions from known data
- Different information used for predictions
- Sequence motifs
- N-terminal secretory signal peptides,
mitochondrial targeting peptide, chloroplast
transit peptide - C-terminal peroxisome import signal, ER
retention signal - Mid-sequence nuclear localization signals
- Amino acid composition
- AA frequency, dipeptide composition.
- Homology
- - Sequence comparison to proteins of known
localization
19The PSORT Family
- PSORT plant sequences
- Expert rule-based system
- PSORT II eukaryotic sequences
- Probabilistic tree
- iPSORT eukaryotic N-term. signal sequences
- ANN
- PSORT-B bacterial sequences
- WoLF PSORT eukaryotic
- Updated (2005) version of PSORTII
20PSORT-Bhttp//www.psort.org/psortb/
21PSORT-B - methods
- Signal peptides Non-cytoplasmic
- AA composition/patterns
- SVMs trained for each location vs. all other
locations - Transmembrane helices Inner membrane
- HMMTOP
- PROSITE motifs all localizations
- Outer membrane motifs Outer membrane
- Homology to proteins of known localization
- SCL-BLAST
Integration with a Bayesian network
22PSORT-B results
- SeqID Unannotated_bacterial2
- Analysis Report
- CMSVM- Unknown No details
- CytoSVM- Cytoplasmic No details
- ECSVM- Unknown No details
- HMMTOP- Unknown No internal
helices found - Motif- Unknown No motifs
found - OMPMotif- Unknown No motifs
found - OMSVM- Unknown No details
- PPSVM- Unknown No details
- Profile- Unknown No matches
to profiles found - SCL-BLAST- Cytoplasmic matched
118438 Cyto. protein - SCL-BLASTe- Unknown No matches
against database - Signal- Unknown No signal
peptide detected - Localization Scores
- Cytoplasmic 9.97
- CytoplasmicMembrane 0.01
- Periplasmic 0.01
- OuterMembrane 0.00
23Proteome Analysthttp//www.cs.ualberta.ca/bioinf
o/PA/Sub/
24Proteome Analyst Feature Extraction
- TOP 3 Homologs
- ? AFP1_ARATH
- AFP1_BRANA
- AFP2_ARATH
- KW
- Plant defense Fungicide
- Signal Multigene Family
- Pyrrolidone carboxylic acid
- DR InterPro
- IPR002118 IPR003614
- CC Subcellular location
- Secreted
- Token Set
Plant defense Fungicide Signal Multigene
Family Pyrrolidone carboxylic acid IPR002118
IPR003614 Secreted
25PASub - Results
Contribution of each token
Log scale
Features
26PASub - Interpretation
- Bars represent -log probability, so a little
difference is a lot! - Naïve Bayes chosen as classifier because of
transparency of method - Each token gives a probability that can be summed
and shown graphically - Neural network actually has higher recall
- Can change token set, ask to explain with
different features
27Save Time Pre-computed Genomes
- PSORTDB
- http//db.psort.org
- Browse, search, BLAST, download
- 103 Gram ve bacteria, 45 Gram ve bacteria
- Proteome Analyst (PA-GOSUB)
- http//www.cs.ualberta.ca/bioinfo/PA/GOSUB/
- Browse, search, BLAST, download
- 15 bacterial and 8 eukaryotic
28Summary
- The data set of experimentally validated protein
localizations is ever increasing, especially with
high-throughput methods - Many localization signals are still unknown,
except for simple sequence motifs - Prediction methods are very accurate, especially
for bacteria and using machine learning
techniques, but many motifs and other signals
have yet to be discovered
29Future Directions
- Predict proteins with multiple localization
sites, or with localization that changes over
time - Integrate structural information into
localization prediction