Protein Classification Using Proteome Analyst - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Protein Classification Using Proteome Analyst

Description:

Duane Szafron, Paul Lu, Russell Greiner, David Wishart, ... Eisner, John Anvik,Cam Macdonell, Gary Van Domselaar, Chris Upton, Alona ... Biogenesis ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 39
Provided by: bioinforma8
Category:

less

Transcript and Presenter's Notes

Title: Protein Classification Using Proteome Analyst


1
Protein Classification UsingProteome Analyst
  • MITACS Bioinformatics Seminar
  • Duane Szafron, Paul Lu, Russell Greiner, David
    Wishart, Zhiyong Lu, Brett Poulin, Roman Eisner,
    John Anvik,Cam Macdonell, Gary Van Domselaar,
    Chris Upton, Alona Fyshe, David Meeuwis.

2
Outline
  • Mathematical Background
  • Naïve Bayes Classification
  • Protein Classification
  • Proteome Analyst walkthrough
  • Proteome Analysts Explain Feature

3
Mathematical Background
  • Conditional Probability
  • P(AB)P(A,B)/P(B)
  • Bayes Rule
  • P(AB)P(BA)P(A)/P(B)

4
Mathematical Background
  • Conditional Independence
  • C Class (Hypothesis), F Feature (Evidence)
  • So
  • P(C,F1Fn)
  • P(C) P(F1C) P(F2C,F1)P(FnC,F1Fn-1) (chain
    rule)
  • P(C)p1..n P(FiC) P(AB,C) P(AC)
  • (assuming conditional independence of
  • Features given Class)

5
Naïve Bayes
  • Naïve Bayes classifiers
  • Are based on standard Bayesian statistical theory
  • Assume conditionally independent evidence
  • Are explainable

6
Naïve Bayes
  • Classification/Analysis (Common Use)
  • Given a set of evidence without a known class
    compute the probability for each class
  • P(CFiFn) for each C
  • Training (Seldom Done)
  • Given a set of training data with known classes
    we estimate the parameters of the classifier
  • P(C)p1..n P(FiC)

7
Proteome Analyst Process
Training Process
Sequences
Classes
PA Tools
Features
Classes
Analysis Process
ML
Classes
Sequences
PA Tools
Features
Classifier
Explain
8
Proteome Analyst Process
Training Process
Sequences
Classes
PA Tools
Features
Classes
Analysis Process
ML
Classes
Sequences
PA Tools
Features
Classifier
Explain
9
Proteome Analyst Process
Training Process
Sequences
Classes
PA Tools
Features
Classes
Analysis Process
ML
Classes
Sequences
PA Tools
Features
Classifier
Explain
10
Analysis Process(Classification)


  • INPUT
  • sequences
  • Extract Features
  • sequences ? features
  • Classifier
  • features ? classes, explanation
  • OUTPUT
  • Classes
  • explanation

11
Analysis INPUT


  • gtABC1_YEAST
  • MVTNMVKLRNLRRLYCSSRLLRTIQNGRSSV
  • gtABP1_YEAST
  • MALEPIDYTTHSREIDAEYLKIVRGSDPD
  • gtACE1_YEAST
  • MVVINGVKYACETCIRGHRAAQCTHTDG
  • .
  • .
  • .

12
Analysis Feature Extraction


  • sequences ? features (Evidence)

Black Box
  • Sequence ? ? Features

13
Feature Extraction


  • Black Box

Protein 1 - F1 F3 F5 F9 Protein 2 - F6 F7
Protein 3 - F2 F4 Protein 4 - F8 Protein 5 -
F1 F3 F5 F2 F10 F4 F9 Protein 6 - F3 F5
F6 Protein 7 - F4 F6 F8 F9 Protein 8 - F1 F3 F5
Protein 9 - F1 F3 F5 F9
14
Feature Extraction


  • Black Box (significantly similar proteins)

Protein 1 - F1 F3 F5 F9 Protein 2 - F6 F7
Protein 3 - F2 F4 Protein 4 - F8 Protein 5 -
F1 F3 F5 F2 F10 F4 F9 Protein 6 - F3 F5
F6 Protein 7 - F4 F6 F8 F9 Protein 8 - F1 F3 F5
Protein 9 - F1 F3 F5 F9
15
Feature Extraction


  • Black Box (significantly similar proteins)

Protein 1 - F1 F3 F5 F9 Protein 2 - F6 F7
Protein 3 - F2 F4 Protein 4 - F8 Protein 5 -
F1 F3 F5 F2 F10 F4 F9 Protein 6 - F3 F5
F6 Protein 7 - F4 F6 F8 F9 Protein 8 - F1 F3 F5
Protein 9 - F1 F3 F5 F9
Features (evidence) for the input sequence F1,
F3, F5, F6, F7
16
Analysis Classification


  • features ? classes
  • Select the most probable class based on the
    evidence gathered in feature extraction
  • Maxj(P(CjFiFn) )

17
Training


  • INPUT
  • sequences, classes
  • Extract Features
  • sequences ? features
  • ML Algorithm
  • features, classes ? Classifier
  • OUTPUT
  • Classifier

18
Training INPUT


  • gtCellular processeslt YCB9_YEASTgi00140351
  • MESQQLSNYPHISHGSACASVTSKEVHTNQDPLDVSAS
  • gtTranslationltPEPQ_ECOLIgi00417465
  • MESLASLYKNHIATLQERTRDALARFKLDALLIHSGELFN
  • gtRegulatory functionsltSWI4_YEASTgi00730855
  • MPFDVLISNQKDNTNHQNITPISKSVLLAPHSNHPVIEIAT
  • .
  • .
  • .

19
Training INPUT


  • gtCellular processesltYCB9_YEASTgi00140351
  • MESQQLSNYPHISHGSACASVTSKEVHTNQDPLDVSAS
  • gtTranslationltPEPQ_ECOLIgi00417465
  • MESLASLYKNHIATLQERTRDALARFKLDALLIHSGELFN
  • gtRegulatory functionsltSWI4_YEASTgi00730855
  • MPFDVLISNQKDNTNHQNITPISKSVLLAPHSNHPVIEIAT
  • .
  • .
  • .

classes
20
Training Feature Extraction


  • Same as analysis

Black Box
  • Sequence ? ? Features

21
Training OUTPUT


  • Final Output
  • Features classes Classifier

22
Results Subcellular localization



23
Proteome Analyst Walkthrough
24
Subcellular Locations Animal
Golgi Nucleus Extracellular Mitochondrion Cytoplas
m Plasma Membrane Lysosome Peroxisome Endoplasmic
Reticulum Secreted
http//www.probes.com/handbook/figures/0908.html
25
Step 1 Obtain the translated sequence (FASTA
format)
gtPEX6_HUMANQ13608Homo sapiens (Human)
cytoplasmic MALAVLRVLEPFPTETPPLAVLLPPGGPWPAAELGLVL
ALRPAGESPAGPALLVAALEGPDAGTEEQGPGPPQLLVSRALLRLLALGS
GAWVRARAVRRPPALGWALLGTSLGPGLGPRVGPLLVRRGETLPVPGPRV
LETRPALQGLLGPGTRLAVTELRGRARLCPESGDSSRPPPPPVVSSFAVS
GTVRRLQGVLGGTGDSLGVSRSCLRGLGLFQGEWVWVAQARESSNTSQPH
LARVQVLEPRWDLSDRLGPGSGPLGEPLADGLALVPATLAFNLGCDPLEM
GELRIQRYLEGSIAPEDKGSCSLLPGPPFARELHIEIVSSPHYSTNGNYD
GVLYRHFQIPRVVQEGDVLCVPTIGQVEILEGSPEKLPRWREMFFKVKKT
VGEAPDGPASAYLADTTHTSLYMVGSTLSPVPWLPSEESTLWSSLSPPGL
EALVSELCAVLKPRLQPGGALLTGTSSVLLRGPPGCGKTTVVAAACSHLG
LHLLKVPCSSLCAESSGAVETKLQAIFSRARRCRPAVLLLTAVDLLGRDR
DGLGEDARVMAVLRHLLLNEDPLNSCPPLMVVATTSRAQDLPADVQTAFP
HELEVPALSEGQRLSILRALTAHLPLG..
  • Example
  • PEX6, a protein from the Swiss-Prot database
    (accession Q13608)
  • 980 Amino Acids long
  • Codes for Human Peroxisome Assembly Factor
  • Cytoplasmic

26
Step 1 Continued
  • Peroxisome
  • Cell structure that rids the body of toxic
    substances in the liver, kidneys, and brain
  • Zellwegers Syndrome
  • Caused by a one amino mutation in PEX6
  • Caused by interference in peroxisome
    manufacturing so that few or none are made.
  • Symptoms
  • Enlarged liver
  • High levels of iron and copper in the blood
  • Vision disturbances
  • Facial abnormalities
  • Early death, often within 6 months of onset

27
Step 2 Log in to Proteome Analyst
28
Step 3 Upload the Sequence
29
Step 4 Process the sequence.
  • Processed using Proteome Analysts built in
    classifiers.
  • Subcellular Classifier
  • Trained on PAs Animal dataset
  • Different datasets used for different
    phylogenetic families

30
Step 5 View Results
31
Feature Extraction
  • Features collected from the Black Box
  • repeat, disease mutation, cytoplasmic,
    atp-binding, membrane protein,
  • ipr003959, ipr003593, ipr003960, peroxisome
  • InterPro is a database of protein families.
  • Ipr003593 - AAA ATPase
  • Ipr003959 - AAA ATPase, central region
  • Ipr003960 - AAA-protein subdomain
  • ATPase - an enzyme that uses ATP as a source of
    energy to drive some other
  • process e.g. Peroxisome Biogenesis
  • AAA 'A'TPases 'A'ssociated with diverse
    cellular 'A'ctivities
  • Proposed that these proteins act as ATP-dependent
    protein clamps

32
Classification
33
Explain
  • P(C)pi P(EiC)
  • Factors P(EiC) are multiplied to get the
    probabilities for each class
  • Explain graph is on a log scale so the
    contributions become additive and easily
    visualized
  • log(P(C)pi P(EiC) )
  • log(P(C) )?i log(P(EiC) )

34
Explain
  • PA produces interpretable, transparent results

35
Explain Champion vs. Contender
36
Questions?
http//www-gap.dcs.st-and.ac.uk/history/PictDispl
ay/Bayes.html
  • Contactbioinfo_at_cs.ualberta.ca
  • http//www.cs.ualberta.ca/bioinfo/PA
  • Special Thanks to Thomas Bayes and the PA-Team

37
Classifier Evaluation
  • K-fold Cross Validation
  • E.g. k 3
  • Confusion Matrix
  • 2 classes positive negative

38
Information Content and Wrapping
  • Compute for each Feature Fj
  • I(Fj) ?i?k P(TjVk, CCi)log(p(P(TjVk,
    CCi)/ (P(TjVk)P( CCi))
  • Order Features by I(Fj)
  • for(int i 0 i lt 100 i i5)
  • Create classifier with i features with the lowest
    I(F)removed from the training set
  • In practice best classifiers are created with 60
    of features removed
Write a Comment
User Comments (0)
About PowerShow.com