Protein Subcellular Localization Prediction of Eukaryotes using a Knowledge-based Approach PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Protein Subcellular Localization Prediction of Eukaryotes using a Knowledge-based Approach


1
Protein Subcellular Localization Prediction of
Eukaryotes using a Knowledge-based Approach
  • Hsin-Nan Lin, Ching-Tai Chen, Ting-Yi Sung,
  • Shinn-Ying Ho, Wen-Lian Hsu
  • Bioinformatics Program, TIGP (Taiwan
    International Graduate Program), Academia Sinica,
    Taiwan

2
Outline
  • Introduction
  • Methods
  • Construction of the knowledge base SPKB
  • KnowPredsite a localization prediction method
    using SPKB
  • Dataset
  • Prediction Results
  • Conclusions

3
Protein Subcellular Localization (PSL)
  • Given a protein, determine its subcellular
    compartment
  • Mitochondria, cytoplasm, nuclear,.etc
  • It is important because
  • Help elucidate protein functions
  • Identify potential diagnostic and drug targets
  • Wet-lab experiment
  • Time consuming
  • Labor intensive
  • Computational prediction is needed.

4
Existing PSL Predictors
  • Using various features
  • Features extracted from literature or public
    databases
  • Phylogenetic profiling
  • Compartment-specific features
  • Main problem of many predictors
  • They only predict a limited number of locations
  • Limited to subsets of proteomes, e.g., those
    containing signal peptide sequences
  • Designed for specific species
  • Designed for single-localized protein sequences
  • Up to 35 of proteins move between different
    cellular compartments

5
Outline
  • Introduction
  • Methods
  • Concept behind the method
  • Construction of the knowledge base SPKB
  • KnowPredsite a localization prediction method
    using SPKB
  • Dataset
  • Prediction Results
  • Conclusions

6
Basic Concept Behind the Method
Transitivity Relations
7
A New Protein Feature Similar Peptide
  • High Scoring Pair (HSP)
  • A significant local pairwise alignment of two
    proteins by PSI-BLAST
  • Interchangeable Amino Acid Pair
  • A positive score in the BLOSUM62.
  • Similarity Level
  • The number of interchangeable amino acid pairs
    within a sliding window
  • Similar Peptide
  • Represents possible sequence variation.
  • An n-gram peptide fragment from a similar protein

8
An Example of High Scoring Pairs (HSP)
MYKKILY MY KIL MYSKILL
Window size 7 Pairwise similarity 5
9
Construction of Similar Peptide Knowledge Base
(SPKB)
A protein with annotated localization site CYT
HSP from PSI-BLAST
A protein from NCBI NR database
MENIKKE ME KK MEAVKKS
If pairwise similarity ? similarity level
, MEAVKKS is a similar peptide Inherit CYT
Similarity level 4 Pairwise similarity 5
10
A similar peptide entry for protein subcellular
localization
Peptide MYSKILL
SPKB
11
KnowPredsite a localization prediction method
using SPKB
12
Blast-hit Method
  • Serves as a baseline approach
  • Use Blast to find the most similar sequence
  • Inherit the localization annotation
  • E-value 10-3
  • If there is no hit, no annotation will be
    inherited

13
Evaluation
14
Outline
  • Introduction
  • Methods
  • Concept behind the method
  • Construction of the knowledge base SPKB
  • KnowPredsite a localization prediction method
    using SPKB
  • Dataset
  • Prediction Results
  • Conclusions

15
Dataset
  • From ngLOC method
  • 25,887 Single-localized proteins
  • 2,169 multi-localized proteins
  • 1923 different species
  • 10 different subcellular locations
  • CYT (cytoplasm)
  • CSK (cytoskeleton)
  • END (endoplasmic reticulum)
  • EXC (extracellular)
  • GOL (Golgi apparatus)
  • LYS (lysosome)
  • MIT (mitochondria)
  • NUC (nuclear)
  • PLA (plasma membrane)
  • POX (perixosome)

16
Outline
  • Introduction
  • Methods
  • Concept behind the method
  • Construction of the knowledge base SPKB
  • KnowPredsite a localization prediction method
    using SPKB
  • Dataset
  • Prediction Results
  • Conclusions

17
Finding the Best Similarity Level and Window Size
Leave-one-out cross validation is performed.
Similarity Level 0 1 2 3 4 5 6 7 8
Overall Accuracy w7 91.2 91.2 91.3 91.4 91.5 91.8 92.0 91.6 -
Overall Accuracy w8 91.4 91.4 91.4 91.4 91.4 91.5 91.6 91.7 90.9
18
Prediction Performance for Single-localized
Proteins
KnowPredsite leave-one-out cross
validation KnowPredsite ten-fold cross
validation
Overall Accuracy () Methods Top 1 Top 2 Top 3 Top 4
Single-localized KnowPredsite 92.0 95.7 96.8 98.1
Single-localized KnowPredsite 91.7 95.4 96.6 97.9
Single-localized ngLOC 88.8 92.2 94.5 96.3
Single-localized Blast-hit 86.0 - - -
19
Prediction Performance for Multi-localized
Proteins
Overall Accuracy () Methods Top 1 Top 2 Top 3 Top 4
Multi-localized (at least 1 correct) KnowPredsite 90.8 96.4 98.2 98.9
Multi-localized (at least 1 correct) KnowPredsite 90.1 96.1 98.1 98.9
Multi-localized (at least 1 correct) ngLOC 81.9 92.0 96.1 97.4
Multi-localized (at least 1 correct) Blast-hit 78.8 - - -
Multi-localized (both correct) KnowPredsite 74.3 83.3 88.7
Multi-localized (both correct) KnowPredsite 72.1 82.2 87.5
Multi-localized (both correct) ngLOC 59.7 73.8 83.2
Multi-localized (both correct) Blast-hit 45.7 - -
20
Site-Specific Prediction Performance
Site i Occurrence in the dataset () Precision () Accuracyi () MCCi
CYT 11.1 75.7 84.4 0.774
CSK 1.0 81.1 52.0 0.645
END 3.6 92.9 84.1 0.88
EXC 29.1 98.5 93.9 0.946
GOL 1.1 79.1 70.9 0.746
LYS 0.6 87.2 81.9 0.844
MIT 9.4 96.7 86.9 0.907
NUC 18.0 87.3 93.8 0.884
PLA 25.2 94.4 96.4 0.938
POX 0.8 87.3 85.1 0.861
21
Multi-localized Confidence Score (MLCS)
  • We follow King and Gudas method, for a protein t
  • MLCS(t) (CS1 CS2) - (CS12 -CS22)/100
  • CS1 highest confidence score among all the sites
  • CS2 2nd highest confidence score among all the
    sites
  • Best MLCS threshold of KnowPredsite 20
  • 86.3 of multi-localized proteins have MLCS gt 20
  • 82.8 of single-localized proteins have MLCS lt 20

22
Case Study EF1A2_RABIT
Query CYT CSK END EXC GOL LYS MIT NUC PLA POX MLCS
EF1A2_RABIT 95.45 0 0 1.45 0 0 0.04 2.97 0.05 0 7.40

Template CYT CSK END EXC GOL LYS MIT NUC PLA POX SI
EF1A2_RAT 0 0 0 0 0 0 0 2.94 0 0 99.78
EF1A_CHICK 2.77 0 0 0 0 0 0 0 0 0 92.22
EF1A1_HUMAN 2.75 0 0 0 0 0 0 0 0 0 92.22
EF1A1_RAT 2.75 0 0 0 0 0 0 0 0 0 92.22
EF1A0_XENLA 2.69 0 0 0 0 0 0 0 0 0 90.06
EF1A_BRARE 2.64 0 0 0 0 0 0 0 0 0 90.06
EF1A2_XENLA 2.64 0 0 0 0 0 0 0 0 0 88.79
EF1A3_XENLA 2.60 0 0 0 0 0 0 0 0 0 88.55
Swiss-Prot NUC Gene Ontology CYT NUC
23
Case Study MCA3_MOUSE
Query CYT CSK END EXC GOL LYS MIT NUC PLA POX MLCS
MCA3_MOUSE 95.46 0.3 0.27 0.36 0.2 0.01 1.13 93.59 1.82 0.22 100

Template CYT CSK END EXC GOL LYS MIT NUC PLA POX SI
MCA3_HUMAN 89.16 0 0 0 0 0 0 89.16 0 0 88.51
EF1G1_YEAST 2.74 0 0 0 0 0 0 2.47 0 0 8.67
EF1G2_YEAST 0.49 0 0 0 0 0 0.49 0 0 0 8.50
GSTA_PLEPL 0.35 0 0 0 0 0 0 0 0 0 15.86
SYEC_YEAST 0.16 0 0 0 0 0 0 0 0 0 3.86
CCNA1_MOUSE 0 0.15 0 0 0 0 0 0 0 0 7.36
NU155_RAT 0.14 0 0 0 0 0 0 0.14 0 0 3.17
GCYB2_HUMAN 0.14 0 0 0 0 0 0 0 0 0 4.86
24
Outline
  • Introduction
  • Methods
  • Concept behind the method
  • Construction of the knowledge base SPKB
  • KnowPredsite a localization prediction method
    using SPKB
  • Dataset
  • Prediction Results
  • Conclusions

25
Conclusions
  • We proposed a sequence based prediction method
    KnowPredsite which can
  • Predicts single-localized proteins with 92
    accuracy
  • Predicts multi-localized proteins with 74.3
    accuracy
  • Suitable for proteome-wide prediction
  • 25887 single-localized proteins
  • 2169 multi-localized proteins
  • Provides interpretable prediction results through
    template proteins
  • KnowPredsite can be easily improved by
    incrementally expanding the SPKB

26
Thank You.
27
Multi-localized Confidence Score
  • TP a multi-localized protein with MLCS gt 20
  • TN a single-localized protein with MLCS lt 20

28
Site-Specific Comparison
ngLOC performs better in CYT LYS
Write a Comment
User Comments (0)
About PowerShow.com