Title: Shan Sundararaj
1Protein Subcellular Localization
- Shan Sundararaj
- University of Alberta
- Edmonton, AB
- ss23_at_ualberta.ca
2Why is Localization Important?
- Function is dependent on context
- Co-localization of proteins of related function
- Valuable annotation for new proteins
- Design of proteins with specific targets
- Drug targeting
- Accessibility
- Membrane-bound gt cytoplasmic gt nuclear
3Why is Localization Important?
- 1974 Nobel Prize in Physiology/Medicine
- George Palade
- for discoveries concerning the structural and
functional organization of the cell - 1999 Nobel Prize in Physiology/Medicine
- Günter Blobel
- for the discovery that proteins have intrinsic
signals that govern their transport and
localization in the cell
4Bacteria
Gram Positive (3-4 states)
Gram Negative (5 states)
Extracellular
cytoplasm
cytoplasm
periplasm
cytoplasmic membrane
cytoplasmic membrane
cell wall
outer membrane
Extracellular
5Eukaryotic Cell
- Compartmentalized
- Diverse range of specific organelles
- Plants chloroplasts, chromoplasts, other
plastids - Muscle sarcoplasm
- Various endosomes, vesicles
(modified from Voet Voet, Biochemystry
Wiley-VCH 1992)
6Yet more categories
Chloroplast
Mitochondrion
Yeast specific
7Level of Annotation
- As simple as two states
- membrane protein vs. non-membrane protein
- secreted protein vs. non-secreted protein
- Gross compartments
- cytoplasm, inner membrane, periplasm, cell wall,
outer membrane, extracellular - nucleus, mitochondria, peroxisome, vacuole
- Fine compartments
- Mitochondrial matrix, bud neck, spindle pole
- Any of 1425 GO cellular compartments
8Localization signaling
- Proteins must have intrinsic signals for their
localization a cellular address - E.g. N-terminal signal sequences
321 Nuclear Inner Membrane Lane Nucleus,
Intracellular county Eukaryotic Cell CL34V3M3
9Localization signaling
- Some signals are easily recognizable
- Signal peptidase cleavage site, consensus
sequence for secretion ? extracellular - Address printed neatly, postal code
- Others are difficult to understand
- Outer membrane b-barrel proteins, no consensus
sequence, few sequence restraints - Sloppy address, different kind of code that we
dont understand yet
10Experimental determination
- Since dont fully understand the language of
proteins, our knowledge must often come from
inference - Predicting localization is like sorting mail
based only on examples of where some mail has
gone before - Important to have good data sets of proteins with
known localizations
11Datasets
- Organelle_DB (http//organelledb.lsi.umich.edu/)
- 25095 eukaryotic proteins from subcellular
proteomics studies - DBSubLoc (http//www.bioinfo.tsinghua.edu.cn/guot
ao/download.html) - Combines SwissProt and PIR annotations (64051
proteins) - PSORTDB (http//db.psort.org/)
- Bacterial. 1591 Gram ve proteins, 574 Gram ve
proteins - SignalP (http//www.cbs.dtu.dk/ftp/signalp/)
- 940 plant and 2738 human proteins
- YPL (http//bioinfo.mbb.yale.edu/genome/localize/)
- 2956 yeast proteins
12Experimental Methods
- Electron microscopy
- GFP tagging / fluorescence microscopy
- Subcellular fractionation detection
- Western blotting
- Mass spectrometry
13Electron Microscopy
- Highest resolution, can work at the level of a
single protein complex - Immunolabel proteins of interest in conjunction
with colloidal gold, and visualize - Combined with electron tomography, can even
visualize unlabeled complexes
(from Koster and Klumperman, Nat Rev Mol Cell
Biol, Sep 2003, S6-10)
14Fluorescence Microscopy
- Tag gene at either 3 or 5 end
- Using GFP (or RFP, YFP, CFP, etc.)
- Using an epitope tag and a fluorescently labeled
antibody - Careful of removing signal peptides!
- Also use a subcellular-specific marker or stain
- Visualize with confocal fluorescence microscopy
and analyze images for co-localization
15Specific co-labeling (yeast)
- Early GolgiCop1
- Endosome Snf7
- ER to Golgi Sec13
- Golgi apparatus Anp1
- Late Golgi Chc1
- Lipid particle Erg6
- Mitochondrion MitoTracker
- Nucleus DAPI
- Nucleolus Sik1
- Nuclear periphery Nic96
- Peroxisome Pex3
- Vacuole FM4-64
Nuclear-specific DAPI staining
16Subcellular Fractionation
transfer supernatant
transfer supernatant
transfer supernatant
1000 g
10,000 g
100,000 g
Pellet microsomal Fraction (ER,
golgi, lysosomes, peroxisomes)
Pellet unbroken cells nuclei chloroplast
Pellet mitochondria
Super. Cytosol, Soluble enzymes
tissue homogenate
17Detergent Fractionation
Cells
Extraction with Digitonin/EDTA
supernatant
pellet
Extraction with TritonX100/EDTA
Cytoplasmic Fraction
Extraction with SDS/EDTA
Organelle Membranes
Nuclear
Cytoskeletal (in SDS)
18Fractionation ? Identification
- Once fractionated, take compartment of interest
and separate proteins - 2D gel or chromatography
- Identify separated proteins
- Mass spectrometry for high-throughput
- Western blot for specific proteins
19Fractionation in proteomics
20High-Throughput Experiments
- Kumar et al., Genes Dev 2002, 16707-719
- Epitope-tagged gt60 of ORFs, visualized with
fluorescently labeled antibody - 2744 localizations (44 of S. cerevisiae genes)
- Huh et al., Nature 2003, 425686-691
- GFP tagged all ORFs, RFP tagged compartments
- 4156 localizations (75 of S. cerevisiae genes)
- Combined, now nearly 87 of yeast proteins have a
localization annotation
21High-Throughput Experiments
- Lopez-Campistrous et al, Mol Cell Proteomics,
2005 - Subcellular fractionation of E. coli, 2D-gel
separation, MS-MS - 2,160 localizations to cytoplasm, inner membrane,
periplasm, and outer membrane
22Predictions from known data
- Enough experimental data exists to build highly
accurate computational predictors of localization
23Predictions from known data
- Different information used for predictions
- Sequence motifs
- N-terminal secretory signal peptides,
mitochondrial targeting peptide, chloroplast
transit peptide - C-terminal peroxisome import signal, ER
retention signal - Mid-sequence nuclear localization signals
- Amino acid composition
- AA frequency, dipeptide composition.
- Homology
- - Sequence comparison to proteins of known
localization
24N-terminal signal peptides
- Common structure of signal peptides
- positively charged n-region, followed by a
hydrophobic h-region and a neutral but polar
c-region.
25N-terminal signal peptides
26More work to do
- Multiple bacterial secretion pathways
- C-terminal signal peptides
- Internal mitochondrial transit peptides
- Structural aspects of targeting
- Gene re-localization
- Still a lot to discover in how signaling works!
27Computational methods for predicting localization
- Expert rule based methods
- Artificial Neural Nets (ANN)
- Hidden Markov Models (HMM)
- Naïve Bayes (NB)
- Support Vector Machines (SVM)
- Combination of above methods
28Naïve Bayes
- Assumption
- Features are conditionally
- independent, given class labels
- Structure
- 1 level tree
- Class labels root
- Features leaf nodes
- Prediction
- class(f) argmax P(Cc)P(Ff Cc)
- c
29Artificial Neural Network
- Excellent for modeling non-linear input/output
relationships - Robust to noise in training data
- Widely used in bioinformatics
30Support Vector Machines
- Input vectors are separated into positive vs.
negative instance - Map to new feature space
- Find hyperplane that best separates the two
classes by distance
31Evaluating Predictors - Precision
Predicted
True
- of proteins correctly labeled as cyt divided
by the total of proteins labeled as cyt - How often the label is correct
- If there are 90 proteins correctly labeled as
cyt, and 10 proteins incorrectly labeled as
cyt, then the precision is 90/100 0.90.
32Evaluating Predictors - Sensitivity
Predicted
True
- of proteins correctly labeled as cytoplasmic
divided by the total of proteins that are
cytoplasmic - How many of the true results were retrieved
(also called recall or accuracy)
33Predictions from known data
- Different information used for predictions
- Sequence motifs
- N-terminal secretory signal peptides,
mitochondrial targeting peptide, chloroplast
transit peptide - C-terminal peroxisome import signal, ER
retention signal - Mid-sequence nuclear localization signals
- Amino acid composition
- AA frequency, dipeptide composition,
hydrophobicity - Homology
- - Sequence comparison to proteins of known
localization
34TargetP, SignalP, Phttp//www.cbs.dtu.dk/service
s/
- Sequence-based methods
- TargetP (85-90 recall)
- Predicts mitochondria/chloroplast/secreted
- Contains SignalP and ChloroP
- LipoP
- lipoproteins and signal peptides in Gram negative
bacteria - SecretomeP
- non-classical secretion in eukaryotes
35SignalP result
- Common structure of signal peptides
- positively charged n-region, followed by a
hydrophobic h-region and a neutral but polar
c-region.
Cleavage site
Prediction Signal peptide Signal peptide
probability 0.945 Signal anchor probability
0.000 Max cleavage site probability 0.723
between pos. 28 and 29
36Organellar Prediction
- Predotar (http//www.inra.fr/predotar/) (80
recall) - Mitochondrial and plastid sequences N-terminal
sequences - MitoPred (http//mitopred.sdsc.edu/) (82 recall)
- Mitochondrial PFAM domains, AA composition
- MitoProteome (http//www.mitoproteome.org/)
- Database of experimentally predicted human
mitochondrial - MitoP (http//ihg.gsf.de/mitop2/)
- Combines data from multiple experimental and
computational sources to give a consensus score
for each mitochondrial protein in yeast and
human
37The PSORT Family
- PSORT plant sequences
- Expert rule-based system
- PSORT II eukaryotic sequences
- Probabilistic tree
- iPSORT eukaryotic N-term. signal sequences
- ANN
- PSORT-B bacterial sequences
- WoLF PSORT eukaryotic
- Updated (2005) version of PSORTII
38PSORT-Bhttp//www.psort.org/psortb/
39PSORT-B - methods
- Signal peptides Non-cytoplasmic
- AA composition/patterns
- SVMs trained for each location vs. all other
locations - Transmembrane helices Inner membrane
- HMMTOP
- PROSITE motifs all localizations
- Outer membrane motifs Outer membrane
- Homology to proteins of known localization
- SCL-BLAST
Integration with a Bayesian network
40PSORT-B results
- SeqID Unannotated_bacterial2
- Analysis Report
- CMSVM- Unknown No details
- CytoSVM- Cytoplasmic No details
- ECSVM- Unknown No details
- HMMTOP- Unknown No internal
helices found - Motif- Unknown No motifs
found - OMPMotif- Unknown No motifs
found - OMSVM- Unknown No details
- PPSVM- Unknown No details
- Profile- Unknown No matches
to profiles found - SCL-BLAST- Cytoplasmic matched
118438 Cyto. protein - SCL-BLASTe- Unknown No matches
against database - Signal- Unknown No signal
peptide detected - Localization Scores
- Cytoplasmic 9.97
- CytoplasmicMembrane 0.01
- Periplasmic 0.01
- OuterMembrane 0.00
41Proteome Analysthttp//www.cs.ualberta.ca/bioinf
o/PA/Sub/
42Proteome Analyst - Method
43Proteome Analyst - Feature Extraction
44Proteome Analyst Feature Extraction
- TOP 3 Homologs
- ? AFP1_ARATH
- AFP1_BRANA
- AFP2_ARATH
- KW
- Plant defense Fungicide
- Signal Multigene Family
- Pyrrolidone carboxylic acid
- DR InterPro
- IPR002118 IPR003614
- CC Subcellular location
- Secreted
- Token Set
Plant defense Fungicide Signal Multigene
Family Pyrrolidone carboxylic acid IPR002118
IPR003614 Secreted
45PASub - Results
Contribution of each token
Log scale
Features
46PASub - Interpretation
- Bars represent -log probability, so a little
difference is a lot! - Naïve Bayes chosen as classifier because of
transparency of method - Each token gives a probability that can be summed
and shown graphically - Neural network actually has higher recall
- Can change token set, ask to explain with
different features
47Save Time Pre-computed Genomes
- PSORTDB
- http//db.psort.org
- Browse, search, BLAST, download
- 103 Gram ve bacteria, 45 Gram ve bacteria
- Proteome Analyst (PA-GOSUB)
- http//www.cs.ualberta.ca/bioinfo/PA/GOSUB/
- Browse, search, BLAST, download
- 15 bacterial and 8 eukaryotic