Title: Protein Subcellular Localization Prediction of Eukaryotes using a Knowledge-based Approach
1Protein Subcellular Localization Prediction of
Eukaryotes using a Knowledge-based Approach
- Hsin-Nan Lin, Ching-Tai Chen, Ting-Yi Sung,
- Shinn-Ying Ho, Wen-Lian Hsu
- Bioinformatics Program, TIGP (Taiwan
International Graduate Program), Academia Sinica,
Taiwan
2Outline
- Introduction
- Methods
- Construction of the knowledge base SPKB
- KnowPredsite a localization prediction method
using SPKB - Dataset
- Prediction Results
- Conclusions
3Protein Subcellular Localization (PSL)
- Given a protein, determine its subcellular
compartment - Mitochondria, cytoplasm, nuclear,.etc
- It is important because
- Help elucidate protein functions
- Identify potential diagnostic and drug targets
- Wet-lab experiment
- Time consuming
- Labor intensive
- Computational prediction is needed.
4Existing PSL Predictors
- Using various features
- Features extracted from literature or public
databases - Phylogenetic profiling
- Compartment-specific features
- Main problem of many predictors
- They only predict a limited number of locations
- Limited to subsets of proteomes, e.g., those
containing signal peptide sequences - Designed for specific species
- Designed for single-localized protein sequences
- Up to 35 of proteins move between different
cellular compartments
5Outline
- Introduction
- Methods
- Concept behind the method
- Construction of the knowledge base SPKB
- KnowPredsite a localization prediction method
using SPKB - Dataset
- Prediction Results
- Conclusions
6Basic Concept Behind the Method
Transitivity Relations
7A New Protein Feature Similar Peptide
- High Scoring Pair (HSP)
- A significant local pairwise alignment of two
proteins by PSI-BLAST - Interchangeable Amino Acid Pair
- A positive score in the BLOSUM62.
- Similarity Level
- The number of interchangeable amino acid pairs
within a sliding window - Similar Peptide
- Represents possible sequence variation.
- An n-gram peptide fragment from a similar protein
8An Example of High Scoring Pairs (HSP)
MYKKILY MY KIL MYSKILL
Window size 7 Pairwise similarity 5
9Construction of Similar Peptide Knowledge Base
(SPKB)
A protein with annotated localization site CYT
HSP from PSI-BLAST
A protein from NCBI NR database
MENIKKE ME KK MEAVKKS
If pairwise similarity ? similarity level
, MEAVKKS is a similar peptide Inherit CYT
Similarity level 4 Pairwise similarity 5
10A similar peptide entry for protein subcellular
localization
Peptide MYSKILL
SPKB
11KnowPredsite a localization prediction method
using SPKB
12Blast-hit Method
- Serves as a baseline approach
- Use Blast to find the most similar sequence
- Inherit the localization annotation
- E-value 10-3
- If there is no hit, no annotation will be
inherited
13Evaluation
14Outline
- Introduction
- Methods
- Concept behind the method
- Construction of the knowledge base SPKB
- KnowPredsite a localization prediction method
using SPKB - Dataset
- Prediction Results
- Conclusions
15Dataset
- From ngLOC method
- 25,887 Single-localized proteins
- 2,169 multi-localized proteins
- 1923 different species
- 10 different subcellular locations
- CYT (cytoplasm)
- CSK (cytoskeleton)
- END (endoplasmic reticulum)
- EXC (extracellular)
- GOL (Golgi apparatus)
- LYS (lysosome)
- MIT (mitochondria)
- NUC (nuclear)
- PLA (plasma membrane)
- POX (perixosome)
16Outline
- Introduction
- Methods
- Concept behind the method
- Construction of the knowledge base SPKB
- KnowPredsite a localization prediction method
using SPKB - Dataset
- Prediction Results
- Conclusions
17Finding the Best Similarity Level and Window Size
Leave-one-out cross validation is performed.
Similarity Level 0 1 2 3 4 5 6 7 8
Overall Accuracy w7 91.2 91.2 91.3 91.4 91.5 91.8 92.0 91.6 -
Overall Accuracy w8 91.4 91.4 91.4 91.4 91.4 91.5 91.6 91.7 90.9
18Prediction Performance for Single-localized
Proteins
KnowPredsite leave-one-out cross
validation KnowPredsite ten-fold cross
validation
Overall Accuracy () Methods Top 1 Top 2 Top 3 Top 4
Single-localized KnowPredsite 92.0 95.7 96.8 98.1
Single-localized KnowPredsite 91.7 95.4 96.6 97.9
Single-localized ngLOC 88.8 92.2 94.5 96.3
Single-localized Blast-hit 86.0 - - -
19Prediction Performance for Multi-localized
Proteins
Overall Accuracy () Methods Top 1 Top 2 Top 3 Top 4
Multi-localized (at least 1 correct) KnowPredsite 90.8 96.4 98.2 98.9
Multi-localized (at least 1 correct) KnowPredsite 90.1 96.1 98.1 98.9
Multi-localized (at least 1 correct) ngLOC 81.9 92.0 96.1 97.4
Multi-localized (at least 1 correct) Blast-hit 78.8 - - -
Multi-localized (both correct) KnowPredsite 74.3 83.3 88.7
Multi-localized (both correct) KnowPredsite 72.1 82.2 87.5
Multi-localized (both correct) ngLOC 59.7 73.8 83.2
Multi-localized (both correct) Blast-hit 45.7 - -
20Site-Specific Prediction Performance
Site i Occurrence in the dataset () Precision () Accuracyi () MCCi
CYT 11.1 75.7 84.4 0.774
CSK 1.0 81.1 52.0 0.645
END 3.6 92.9 84.1 0.88
EXC 29.1 98.5 93.9 0.946
GOL 1.1 79.1 70.9 0.746
LYS 0.6 87.2 81.9 0.844
MIT 9.4 96.7 86.9 0.907
NUC 18.0 87.3 93.8 0.884
PLA 25.2 94.4 96.4 0.938
POX 0.8 87.3 85.1 0.861
21Multi-localized Confidence Score (MLCS)
- We follow King and Gudas method, for a protein t
- MLCS(t) (CS1 CS2) - (CS12 -CS22)/100
- CS1 highest confidence score among all the sites
- CS2 2nd highest confidence score among all the
sites - Best MLCS threshold of KnowPredsite 20
- 86.3 of multi-localized proteins have MLCS gt 20
- 82.8 of single-localized proteins have MLCS lt 20
22Case Study EF1A2_RABIT
Query CYT CSK END EXC GOL LYS MIT NUC PLA POX MLCS
EF1A2_RABIT 95.45 0 0 1.45 0 0 0.04 2.97 0.05 0 7.40
Template CYT CSK END EXC GOL LYS MIT NUC PLA POX SI
EF1A2_RAT 0 0 0 0 0 0 0 2.94 0 0 99.78
EF1A_CHICK 2.77 0 0 0 0 0 0 0 0 0 92.22
EF1A1_HUMAN 2.75 0 0 0 0 0 0 0 0 0 92.22
EF1A1_RAT 2.75 0 0 0 0 0 0 0 0 0 92.22
EF1A0_XENLA 2.69 0 0 0 0 0 0 0 0 0 90.06
EF1A_BRARE 2.64 0 0 0 0 0 0 0 0 0 90.06
EF1A2_XENLA 2.64 0 0 0 0 0 0 0 0 0 88.79
EF1A3_XENLA 2.60 0 0 0 0 0 0 0 0 0 88.55
Swiss-Prot NUC Gene Ontology CYT NUC
23Case Study MCA3_MOUSE
Query CYT CSK END EXC GOL LYS MIT NUC PLA POX MLCS
MCA3_MOUSE 95.46 0.3 0.27 0.36 0.2 0.01 1.13 93.59 1.82 0.22 100
Template CYT CSK END EXC GOL LYS MIT NUC PLA POX SI
MCA3_HUMAN 89.16 0 0 0 0 0 0 89.16 0 0 88.51
EF1G1_YEAST 2.74 0 0 0 0 0 0 2.47 0 0 8.67
EF1G2_YEAST 0.49 0 0 0 0 0 0.49 0 0 0 8.50
GSTA_PLEPL 0.35 0 0 0 0 0 0 0 0 0 15.86
SYEC_YEAST 0.16 0 0 0 0 0 0 0 0 0 3.86
CCNA1_MOUSE 0 0.15 0 0 0 0 0 0 0 0 7.36
NU155_RAT 0.14 0 0 0 0 0 0 0.14 0 0 3.17
GCYB2_HUMAN 0.14 0 0 0 0 0 0 0 0 0 4.86
24Outline
- Introduction
- Methods
- Concept behind the method
- Construction of the knowledge base SPKB
- KnowPredsite a localization prediction method
using SPKB - Dataset
- Prediction Results
- Conclusions
25Conclusions
- We proposed a sequence based prediction method
KnowPredsite which can - Predicts single-localized proteins with 92
accuracy - Predicts multi-localized proteins with 74.3
accuracy - Suitable for proteome-wide prediction
- 25887 single-localized proteins
- 2169 multi-localized proteins
- Provides interpretable prediction results through
template proteins - KnowPredsite can be easily improved by
incrementally expanding the SPKB
26 Thank You.
27Multi-localized Confidence Score
- TP a multi-localized protein with MLCS gt 20
- TN a single-localized protein with MLCS lt 20
28Site-Specific Comparison
ngLOC performs better in CYT LYS