Title: Protein Classification Using Proteome Analyst
1Protein Classification UsingProteome Analyst
- MITACS Bioinformatics Seminar
- Duane Szafron, Paul Lu, Russell Greiner, David
Wishart, Zhiyong Lu, Brett Poulin, Roman Eisner,
John Anvik,Cam Macdonell, Gary Van Domselaar,
Chris Upton, Alona Fyshe, David Meeuwis.
2Outline
- Mathematical Background
- Naïve Bayes Classification
- Protein Classification
- Proteome Analyst walkthrough
- Proteome Analysts Explain Feature
3Mathematical Background
- Conditional Probability
- P(AB)P(A,B)/P(B)
- Bayes Rule
- P(AB)P(BA)P(A)/P(B)
4Mathematical Background
- Conditional Independence
- C Class (Hypothesis), F Feature (Evidence)
- So
- P(C,F1Fn)
- P(C) P(F1C) P(F2C,F1)P(FnC,F1Fn-1) (chain
rule) - P(C)p1..n P(FiC) P(AB,C) P(AC)
- (assuming conditional independence of
- Features given Class)
5Naïve Bayes
- Naïve Bayes classifiers
- Are based on standard Bayesian statistical theory
- Assume conditionally independent evidence
- Are explainable
6Naïve Bayes
- Classification/Analysis (Common Use)
- Given a set of evidence without a known class
compute the probability for each class - P(CFiFn) for each C
- Training (Seldom Done)
- Given a set of training data with known classes
we estimate the parameters of the classifier - P(C)p1..n P(FiC)
7Proteome Analyst Process
Training Process
Sequences
Classes
PA Tools
Features
Classes
Analysis Process
ML
Classes
Sequences
PA Tools
Features
Classifier
Explain
8Proteome Analyst Process
Training Process
Sequences
Classes
PA Tools
Features
Classes
Analysis Process
ML
Classes
Sequences
PA Tools
Features
Classifier
Explain
9Proteome Analyst Process
Training Process
Sequences
Classes
PA Tools
Features
Classes
Analysis Process
ML
Classes
Sequences
PA Tools
Features
Classifier
Explain
10Analysis Process(Classification)
- INPUT
- sequences
- Extract Features
- sequences ? features
- Classifier
- features ? classes, explanation
- OUTPUT
- Classes
- explanation
11Analysis INPUT
- gtABC1_YEAST
- MVTNMVKLRNLRRLYCSSRLLRTIQNGRSSV
- gtABP1_YEAST
- MALEPIDYTTHSREIDAEYLKIVRGSDPD
- gtACE1_YEAST
- MVVINGVKYACETCIRGHRAAQCTHTDG
- .
- .
- .
12Analysis Feature Extraction
- sequences ? features (Evidence)
Black Box
13Feature Extraction
Protein 1 - F1 F3 F5 F9 Protein 2 - F6 F7
Protein 3 - F2 F4 Protein 4 - F8 Protein 5 -
F1 F3 F5 F2 F10 F4 F9 Protein 6 - F3 F5
F6 Protein 7 - F4 F6 F8 F9 Protein 8 - F1 F3 F5
Protein 9 - F1 F3 F5 F9
14Feature Extraction
- Black Box (significantly similar proteins)
Protein 1 - F1 F3 F5 F9 Protein 2 - F6 F7
Protein 3 - F2 F4 Protein 4 - F8 Protein 5 -
F1 F3 F5 F2 F10 F4 F9 Protein 6 - F3 F5
F6 Protein 7 - F4 F6 F8 F9 Protein 8 - F1 F3 F5
Protein 9 - F1 F3 F5 F9
15Feature Extraction
- Black Box (significantly similar proteins)
Protein 1 - F1 F3 F5 F9 Protein 2 - F6 F7
Protein 3 - F2 F4 Protein 4 - F8 Protein 5 -
F1 F3 F5 F2 F10 F4 F9 Protein 6 - F3 F5
F6 Protein 7 - F4 F6 F8 F9 Protein 8 - F1 F3 F5
Protein 9 - F1 F3 F5 F9
Features (evidence) for the input sequence F1,
F3, F5, F6, F7
16Analysis Classification
- features ? classes
- Select the most probable class based on the
evidence gathered in feature extraction - Maxj(P(CjFiFn) )
17Training
- INPUT
- sequences, classes
- Extract Features
- sequences ? features
- ML Algorithm
- features, classes ? Classifier
- OUTPUT
- Classifier
18Training INPUT
- gtCellular processeslt YCB9_YEASTgi00140351
- MESQQLSNYPHISHGSACASVTSKEVHTNQDPLDVSAS
- gtTranslationltPEPQ_ECOLIgi00417465
- MESLASLYKNHIATLQERTRDALARFKLDALLIHSGELFN
- gtRegulatory functionsltSWI4_YEASTgi00730855
- MPFDVLISNQKDNTNHQNITPISKSVLLAPHSNHPVIEIAT
- .
- .
- .
19Training INPUT
- gtCellular processesltYCB9_YEASTgi00140351
- MESQQLSNYPHISHGSACASVTSKEVHTNQDPLDVSAS
- gtTranslationltPEPQ_ECOLIgi00417465
- MESLASLYKNHIATLQERTRDALARFKLDALLIHSGELFN
- gtRegulatory functionsltSWI4_YEASTgi00730855
- MPFDVLISNQKDNTNHQNITPISKSVLLAPHSNHPVIEIAT
- .
- .
- .
classes
20Training Feature Extraction
Black Box
21Training OUTPUT
- Final Output
- Features classes Classifier
22Results Subcellular localization
23Proteome Analyst Walkthrough
24Subcellular Locations Animal
Golgi Nucleus Extracellular Mitochondrion Cytoplas
m Plasma Membrane Lysosome Peroxisome Endoplasmic
Reticulum Secreted
http//www.probes.com/handbook/figures/0908.html
25Step 1 Obtain the translated sequence (FASTA
format)
gtPEX6_HUMANQ13608Homo sapiens (Human)
cytoplasmic MALAVLRVLEPFPTETPPLAVLLPPGGPWPAAELGLVL
ALRPAGESPAGPALLVAALEGPDAGTEEQGPGPPQLLVSRALLRLLALGS
GAWVRARAVRRPPALGWALLGTSLGPGLGPRVGPLLVRRGETLPVPGPRV
LETRPALQGLLGPGTRLAVTELRGRARLCPESGDSSRPPPPPVVSSFAVS
GTVRRLQGVLGGTGDSLGVSRSCLRGLGLFQGEWVWVAQARESSNTSQPH
LARVQVLEPRWDLSDRLGPGSGPLGEPLADGLALVPATLAFNLGCDPLEM
GELRIQRYLEGSIAPEDKGSCSLLPGPPFARELHIEIVSSPHYSTNGNYD
GVLYRHFQIPRVVQEGDVLCVPTIGQVEILEGSPEKLPRWREMFFKVKKT
VGEAPDGPASAYLADTTHTSLYMVGSTLSPVPWLPSEESTLWSSLSPPGL
EALVSELCAVLKPRLQPGGALLTGTSSVLLRGPPGCGKTTVVAAACSHLG
LHLLKVPCSSLCAESSGAVETKLQAIFSRARRCRPAVLLLTAVDLLGRDR
DGLGEDARVMAVLRHLLLNEDPLNSCPPLMVVATTSRAQDLPADVQTAFP
HELEVPALSEGQRLSILRALTAHLPLG..
- Example
- PEX6, a protein from the Swiss-Prot database
(accession Q13608) - 980 Amino Acids long
- Codes for Human Peroxisome Assembly Factor
- Cytoplasmic
26Step 1 Continued
- Peroxisome
- Cell structure that rids the body of toxic
substances in the liver, kidneys, and brain - Zellwegers Syndrome
- Caused by a one amino mutation in PEX6
- Caused by interference in peroxisome
manufacturing so that few or none are made. - Symptoms
- Enlarged liver
- High levels of iron and copper in the blood
- Vision disturbances
- Facial abnormalities
- Early death, often within 6 months of onset
27Step 2 Log in to Proteome Analyst
28Step 3 Upload the Sequence
29Step 4 Process the sequence.
- Processed using Proteome Analysts built in
classifiers. - Subcellular Classifier
- Trained on PAs Animal dataset
- Different datasets used for different
phylogenetic families
30Step 5 View Results
31Feature Extraction
- Features collected from the Black Box
- repeat, disease mutation, cytoplasmic,
atp-binding, membrane protein, - ipr003959, ipr003593, ipr003960, peroxisome
- InterPro is a database of protein families.
- Ipr003593 - AAA ATPase
- Ipr003959 - AAA ATPase, central region
- Ipr003960 - AAA-protein subdomain
- ATPase - an enzyme that uses ATP as a source of
energy to drive some other - process e.g. Peroxisome Biogenesis
- AAA 'A'TPases 'A'ssociated with diverse
cellular 'A'ctivities - Proposed that these proteins act as ATP-dependent
protein clamps
32Classification
33Explain
- P(C)pi P(EiC)
- Factors P(EiC) are multiplied to get the
probabilities for each class - Explain graph is on a log scale so the
contributions become additive and easily
visualized - log(P(C)pi P(EiC) )
- log(P(C) )?i log(P(EiC) )
34Explain
- PA produces interpretable, transparent results
35Explain Champion vs. Contender
36Questions?
http//www-gap.dcs.st-and.ac.uk/history/PictDispl
ay/Bayes.html
- Contactbioinfo_at_cs.ualberta.ca
- http//www.cs.ualberta.ca/bioinfo/PA
- Special Thanks to Thomas Bayes and the PA-Team
37Classifier Evaluation
- K-fold Cross Validation
- E.g. k 3
- Confusion Matrix
- 2 classes positive negative
38Information Content and Wrapping
- Compute for each Feature Fj
- I(Fj) ?i?k P(TjVk, CCi)log(p(P(TjVk,
CCi)/ (P(TjVk)P( CCi)) - Order Features by I(Fj)
- for(int i 0 i lt 100 i i5)
- Create classifier with i features with the lowest
I(F)removed from the training set -
- In practice best classifiers are created with 60
of features removed