Proteome Analyst - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Proteome Analyst

Description:

... David Wishart, Zhiyong Lu, Brett Poulin, Roman Eisner, John Anvik,Cam Macdonell ... my wife, Jen. Contact. http://www.cs.ualberta.ca/~bioinfo/PA. poulin_at_cs. ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 59
Provided by: brettp8
Category:
Tags: analyst | cam | proteome | wife

less

Transcript and Presenter's Notes

Title: Proteome Analyst


1
Proteome Analyst
  • Transparent High-throughput Protein Annotation
    Function, Localization and Custom Predictors

2
Proteome Analyst
  • Duane Szafron, Paul Lu, Russell Greiner, David
    Wishart, Zhiyong Lu, Brett Poulin, Roman Eisner,
    John Anvik,Cam Macdonell

3
Proteome Analyst
  • Proteome
  • one of many -omes
  • set of all proteins in an organism
  • Analysis
  • prediction of protein function or localization
    from sequence data

4
Analyze a Protein
  • We have examples of annotated proteins in various
    protein classes.
  • We have more examples of unannotated proteins.

5
Analyze a Protein
  • We have examples of annotated proteins in various
    protein classes.
  • We have more examples of unannotated proteins.
  • What do we do?

6
Analyze a Protein
  • We have examples of annotated proteins in various
    protein classes.
  • We have more examples of unannotated proteins.
  • What do we do?
  • Find homologues to each protein and assume
    similar function.

7
Analyze a Protein
  • We have examples of annotated proteins in various
    protein classes.
  • We have more examples of unannotated proteins.
  • What do we do?
  • Find homologues to each protein and assume
    similar function.
  • Find characteristics of each protein that affect
    function.

8
Analyzing Proteins
  • One Protein?

9
Analyzing Proteins
  • One Protein?
  • Just do it.

10
Analyzing Proteins
  • One Protein?
  • Just do it.
  • 5 Proteins?

11
Analyzing Proteins
  • One Protein?
  • Just do it.
  • 5 Proteins?
  • Post-doc familiar with protein classes.

12
Analyzing Proteins
  • One Protein?
  • Just do it.
  • 5 Proteins?
  • Post-doc familiar with protein classes.
  • 50 Proteins?

13
Analyzing Proteins
  • One Protein?
  • Just do it.
  • 5 Proteins?
  • Post-doc familiar with protein classes.
  • 50 Proteins?
  • grad student

14
Analyzing Proteins
  • One Protein?
  • Just do it.
  • 5 Proteins?
  • Post-doc familiar with protein classes.
  • 50 Proteins?
  • grad student
  • 5000 proteins?

15
Analyzing Proteins
  • One Protein?
  • Just do it.
  • 5 Proteins?
  • Post-doc familiar with protein classes.
  • 50 Proteins?
  • grad student
  • 5000 proteins?
  • summer students

16
Proteome Analyst
17
Proteome Analyst
  • High-throughput
  • Transparent
  • Prediction of
  • Protein Function
  • Protein Localization
  • Custom Classification

18
Machine Learning Task
  • Training
  • INPUT sequences, classes
  • OUTPUT Classifier
  • Analysis
  • INPUT sequences, Classifier
  • OUTPUT classes

19
Machine Learning Task
  • Training
  • INPUT sequences, classes
  • OUTPUT Classifier
  • Analysis
  • INPUT sequences, Classifier
  • OUTPUT classes, explanation

20
Training
  • INPUT
  • sequences, classes
  • PA Tools
  • sequences ? features
  • ML Algorithm
  • features, classes ? Classifier
  • OUTPUT
  • Classifier

21
Training INPUT
  • gtclass AltTraining Seq 1
  • MVGSGLLWLALVSCILTQASAVQRGYGN
  • PIEASSYGL...
  • gtclass BltTraining Seq 2
  • LLDEPFRSTENSAGSQGCDKNMSGWYRF
  • VGEGGVRMS...
  • gtclass BltTraining Seq 3
  • EVIAYLRDPNCSSILQTEERNWVSVTSP
  • VQASACRNI...
  • .
  • .
  • .

22
Training INPUT
classes
  • gtclass AltTraining Seq 1
  • MVGSGLLWLALVSCILTQASAVQRGYGN
  • PIEASSYGL...
  • gtclass BltTraining Seq 2
  • LLDEPFRSTENSAGSQGCDKNMSGWYRF
  • VGEGGVRMS...
  • gtclass BltTraining Seq 3
  • EVIAYLRDPNCSSILQTEERNWVSVTSP
  • VQASACRNI...
  • .
  • .
  • .

protein sequences
23
Training PA Tools
  • sequences ? features

24
Training PA Tools
  • sequences ? features
  • Homology Tools (BLAST)
  • sequence ? homologues
  • homologues ? annotations
  • annotations ? features

25
Homology Tool
  • sequence ? features

sequence
seq DB
BLAST
homologues
retrieve
parse
annotations
features
26
Homology Tool
  • sequence ? features

sequence
DBSOURCE swissprot locus MPPB_NEUCR, ... xrefs
(non-sequence databases) ... InterProIPR001431,..
. KEYWORDS Hydrolase Metalloprotease Zinc
Mitochondrion Transit peptide Oxidoreductase
Electron transport Respiratory chain.
seq DB
BLAST
homologues
retrieve
parse
annotations
features
27
Homology Tool
  • sequence ? features

sequence
seq DB
BLAST
homologues
retrieve
parse
annotations
features
28
Training PA Tools
  • sequences ? features
  • Homology Tools (BLAST)
  • sequence ? homologues
  • homologues ? annotations
  • annotations ? features
  • Pattern Tools (PFAM, ProSite, )
  • sequences ? motifs
  • motifs ? features

29
Pattern Tool
  • sequence ? features

sequence
pattern DB
find
patterns
parse
features
30
Pattern Tool
  • sequence ? features

sequence
pattern DB
find
Pfam PF00234 tryp_alpha_amyl 1. PROSITE
PS00940 GAMMA_THIONIN 1. PROSITE PS00305
11S_SEED_STORAGE 1.
patterns
parse
features
31
Pattern Tool
  • sequence ? features
  • not included in current results

sequence
pattern DB
find
patterns
parse
features
32
Training ML Algorithm
  • features, classes ? Classifier

33
Training ML Algorithm
  • features, classes ? Classifier
  • any ML Algorithm may be used
  • default naïve Bayes
  • consistently near-best accuracy
  • (SVM, ANN slightly better)
  • efficient (for high-throughput)
  • easy to interpret

34
Training OUTPUT
  • Classifier

35
Analysis (Classification)
  • INPUT
  • sequences
  • PA Tools
  • sequences ? features
  • Classifier
  • features ? classes, explanation
  • OUTPUT
  • classes

36
Analysis INPUT
  • gtSeq 1
  • DTILNINFQCAYPLDMKVSLQAALQPIV
  • SSLNVSVDG...
  • gtSeq 2
  • AVELSVESVLYVGAILEQGDTSRFNLVL
  • RNCYATPTE...
  • gtSeq 3
  • HVEENGQSSESRFSVQMFMFAGHYDLVF
  • LHCEIHLCD...
  • .
  • .
  • .

37
Analysis INPUT
  • gtSeq 1
  • DTILNINFQCAYPLDMKVSLQAALQPIV
  • SSLNVSVDG...
  • gtSeq 2
  • AVELSVESVLYVGAILEQGDTSRFNLVL
  • RNCYATPTE...
  • gtSeq 3
  • HVEENGQSSESRFSVQMFMFAGHYDLVF
  • LHCEIHLCD...
  • .
  • .
  • .

protein sequences
38
Analysis PA Tools
  • sequences ? features

39
Analysis PA Tools
  • sequences ? features
  • Homology Tools (BLAST)
  • sequence ? homologues
  • homologues ? annotations
  • annotations ? features
  • Pattern Tools (PFAM, ProSite, )
  • sequences ? motifs
  • motifs ? features

40
Analysis Classification
  • features ? classes

41
Analysis Classification
  • features ? classes
  • naïve Bayes
  • returns probabilities of each class for each
    sequence
  • efficient (for high-throughput)
  • easy to interpret

42
Analysis Classification
  • features ? classes, explanation

43
Analysis Classification
  • features ? classes, explanation

44
Analysis Classification
  • features ? classes, explanation

45
Analysis Classification
  • features ? classes, explanation

46
Analysis Classification
  • features ? classes, explanation

47
Results General Function
  • GeneQuiz classification
  • 5-fold x-val accuracy on 14 classes

48
Results General Function
  • GeneQuiz classification
  • 5-fold x-val accuracy on 14 classes

49
Results Specific Function
  • K Ion Channel Proteins
  • 5-fold x-val accuracy on
  • 78 sequences, 4 classes

50
Results Specific Function
  • K Ion Channel Proteins
  • 5-fold x-val accuracy on
  • 78 sequences, 4 classes

51
Results Localization
  • Sub-cellular localization prediction
  • 3146 sequences from 10 classes

52
Results Localization
  • Sub-cellular localization prediction
  • 3146 sequences from 10 classes

53
Results
  • Sub-cellular localization prediction
  • 3146 sequences from 10 classes

54
Proteome Analyst
  • High-throughput
  • Transparent
  • Prediction of
  • Protein Function
  • Protein Localization
  • Custom Classification

55
Acknowledgements
  • Student developers
  • Cynthia Luk
  • Samer Nassar
  • Kevin McKee
  • Biologists
  • Warren Gallin
  • Kathy Magor
  • Data
  • Nair and Rost

56
Acknowledgements
  • Funding
  • PENCE Protein Engineering Network of Centres of
    Excellence
  • NSERC - National Science and Engineering Research
    Council
  • Sun Microsystems
  • AICML - Alberta Ingenuity Centre for Machine
    Learning

57
Acknowledgements
  • Many -ome jokes
  • my wife, Jen

58
Contact
  • http//www.cs.ualberta.ca/bioinfo/PA
  • poulin_at_cs.ualberta.ca
Write a Comment
User Comments (0)
About PowerShow.com