Optimal kmer superstrings for peptide identification from tandem mass spectra PowerPoint PPT Presentation

presentation player overlay
1 / 51
About This Presentation
Transcript and Presenter's Notes

Title: Optimal kmer superstrings for peptide identification from tandem mass spectra


1
Optimal k-mer superstrings for peptide
identification from tandem mass spectra
  • Nathan Edwards
  • Center for Bioinformatics and Computational
    Biology

2
Proteomics
  • Study of observed proteins
  • How much of each?
  • What protein is it?
  • Usually a combination of
  • Wet-lab biological sample manipulation
  • Mass spectrometry
  • Data analysis

3
Sample Preparation
4
(Single Stage) Mass Spectrometry
MS
5
Tandem Mass Spectrometry
MS/MS
6
Peptide Fragmentation
Peptide S-G-F-L-E-E-D-E-L-K
7
Peptide Fragmentation
100
Intensity
0
m/z
250
500
750
1000
8
Peptide Fragmentation
1166
1020
907
778
663
534
405
292
145
88
b ions
S
K
L
E
D
E
E
L
F
G
147
260
389
504
633
762
875
1022
1080
1166
y ions
100
Intensity
0
m/z
250
500
750
1000
9
Peptide Fragmentation
1166
1020
907
778
663
534
405
292
145
88
b ions
S
K
L
E
D
E
E
L
F
G
147
260
389
504
633
762
875
1022
1080
1166
y ions
y6
100
y7
Intensity
y5
b3
b4
b5
y2
y3
y8
y4
b8
b6
b7
b9
y9
0
m/z
250
500
750
1000
10
Peptide Identification
  • Given
  • The mass of the parent ion
  • The MS/MS spectrum
  • Output
  • The amino-acid sequence of the peptide

11
Sequence Database Search
  • Compares peptides from a protein sequence
    database with spectra
  • Filter peptide candidates by
  • Parent mass
  • Digest motif
  • Score each peptide against spectrum
  • Generate all possible peptide fragments
  • Match putative fragments with peaks
  • Score and rank

12
Peptide Candidates
  • Parent ion
  • Typically lt 3000 Da
  • Tryptic Peptides
  • Cut at K or R
  • Search engines
  • Dont handle gt 4 well
  • Long peptides dont fragment well
  • of distinct 30-mers (N30) upper bounds total
    peptide content

13
de Bruijn Sequences
  • de Bruijn sequences represent all words of length
    k from some alphabet A.
  • A 0,1, k 3 s 0001110100
  • A 0,1, k 4 s 0000111101011001000

14
de Bruijn Graph A 0,1, k 4
1
1
0
1
0
1
1
0
1
0
1
0
1
0
0
0
15
de Bruijn Sequences Graphs
  • de Bruijn graphs (k,A)
  • Edges represent length k words from A
  • Each node has
  • in degree A
  • out degree A
  • Eulerian tour constructs de Bruijn sequence.

16
Sequence Database Compression
  • Construct sequence database / superstring that is
  • Complete
  • All 30-mers are present
  • Correct
  • No other 30-mers are present
  • Compact
  • No 30-mer is present more than once

17
SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
18
Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
19
Sequence Databases CSBH-graphs
  • Original sequences correspond to paths

ACDEFGI, ACDEFACG, DEFGEFGI
20
Sequence Databases CSBH-graphs
  • Complete
  • All edges are on some path
  • Correct
  • Output path sequence only
  • Compact
  • No edge is used more than once
  • C3 Path Set uses all edges exactly once.

21
Sequence Databases CSBH-graphs
  • Use each edge exactly once

ACDEFGEFGI, DEFACG
22
Size of C3 Path Set for k-mers
  • Each path costs
  • (k-1)-mer path sequence EOS
  • Sequence database with p paths
  • Nk p k
  • Minimize sequence database size by minimizing
    number of paths
  • subject to C3 constraints

23
Best case senario
  • if CSBH-graph admits an Eulerian path.
  • Sequence database size
  • (k-1) Nk 1
  • How many paths are required if the CSBH-graph is
    not Eulerian?

24
Non-Eulerian Components
  • Net degree
  • b(v) in edges - out edges
  • Total degree surplus
  • B ?b(v)gt0 b(v)
  • For each path
  • Start nodes net degree 1
  • End nodes net degree -1
  • Otherwise, net degree no change
  • To reduce all nodes to net degree 0, must have at
    least B paths.

25
Components w/ B(C) 0
  • Balanced component must have Eulerian tour, so
    require exactly one path.
  • m balanced components

26
Paths Lower Bound
  • The C3 path set must containat least B m
    paths.
  • This lower bound is achievable!
  • Just add (B - 1) restart edges to non-Eulerian
    components

27
Achieving Path Lower Bound
28
k-mer superstrings
  • Solution is optimal, for C3 constraints
  • Polynomial time algorithm in length of original
    sequences
  • General superstring problem
  • Requires completeness only
  • NP-hard Garey Johnson 79
  • Approximable within a factor of 2.5
  • MAX-SNP hard

29
AA Sequence Databases
30
Minimum Size C3 Sequence Database
31
Relative Search Time
SP
UP
IPI-H
SP-VS
UP-VS
32
Constraint Relaxation
  • Why insist on compactness?
  • What about 29-mers?
  • Can we compress still further?
  • Complete, Correct (C2)
  • Use edges more than once, if helpful!
  • How could this possibly help?

33
C2 Superstring
  • Sequence set with p paths Nk p k
  • Costs k to use restart edges
  • Restart edges from nodes v s.t. b(v) gt 0 to v
    s.t. b(v) lt 0
  • Reuse edges instead! provided the path length
    is lt k
  • Transportation problem!

34
C2 Superstring
S vb(v)gt0
Tvb(v)lt0
Cost k
35
C2 Superstring
S vb(v)gt0
Tvb(v)lt0
Cost 0
Cost 0
Cost k
36
C2 Superstring
S vb(v)gt0
Tvb(v)lt0
4
10
Shortcut paths
7
37
C2 Superstring
S vb(v)gt0
Tvb(v)lt0
Cost 0
Cost 0
Cap 1
Cost 0
38
C2 Superstring
39
Extensions and Futher Work
  • Better compression
  • Enumerate tryptic peptides only
  • Relax correctness constraint
  • Other uses of CSBH graphs
  • Compact representation of mer counts
  • Implicit set operations on mers
  • Structural graph properties

40
Thanks
  • Informatics Research _at_ ABI Celera
  • Ross Lippert, Clark Mobarry, Bjarni Halldorsson
  • UMIACS _at_ University of Maryland, CP
  • V.S. Subrahmanian, Fritz McCall, Doan Pham

41
Swiss-Prot
42
Swiss-Prot Variant Annotations
43
Swiss-Prot Variant Annotations
44
Swiss-Prot Sequence
45
Swiss-Prot
  • VarSplic enumerates all variants, conflicts,
    isoforms
  • Swiss-Prot sequence size
  • 60 Mb
  • VarSplic sequence size
  • 95 Mb
  • How many more peptide candidates?

46
Swiss-Prot Variant Annotations
Feature viewer
Variants
47
Swiss-Prot VarSplic Output
P13746-00-01-00 MAVMAPRTLLLLLSGALALTQTWAGSHSM
RYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P13746-01-01-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFI
AVGYVDDTQFVRF P13746-00-00-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-03-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-03-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-04-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVG
YVDDTQFVRF P13746-01-04-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVG
YVDDTQFVRF P13746-00-05-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-05-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-00-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-02-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-02-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF


48
Swiss-Prot VarSplic Output
P13746-00-01-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAA
VMWRRKSS------DRKGGSYTQAASSDSAQ P13746-01-01-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKG
GSYTQAASSDSAQ P13746-00-00-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-00-03-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-03-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-04-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-04-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-05-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-05-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-01-00-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-02-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
SQAASSDSAQ P13746-01-02-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
SQAASSDSAQ


49
Peptide Candidates
  • At most 1 additional peptides in 1.6 times as
    much sequence

50
Mascot Running Time
51
Total Search Time
UniProt
SwissProt
UniProt-VS
IPI-HUMAN
SwissProt-VS
Write a Comment
User Comments (0)
About PowerShow.com