Title: Optimal kmer superstrings for peptide identification from tandem mass spectra
1Optimal k-mer superstrings for peptide
identification from tandem mass spectra
- Nathan Edwards
- Center for Bioinformatics and Computational
Biology
2Proteomics
- Study of observed proteins
- How much of each?
- What protein is it?
- Usually a combination of
- Wet-lab biological sample manipulation
- Mass spectrometry
- Data analysis
3Sample Preparation
4(Single Stage) Mass Spectrometry
MS
5Tandem Mass Spectrometry
MS/MS
6Peptide Fragmentation
Peptide S-G-F-L-E-E-D-E-L-K
7Peptide Fragmentation
100
Intensity
0
m/z
250
500
750
1000
8Peptide Fragmentation
1166
1020
907
778
663
534
405
292
145
88
b ions
S
K
L
E
D
E
E
L
F
G
147
260
389
504
633
762
875
1022
1080
1166
y ions
100
Intensity
0
m/z
250
500
750
1000
9Peptide Fragmentation
1166
1020
907
778
663
534
405
292
145
88
b ions
S
K
L
E
D
E
E
L
F
G
147
260
389
504
633
762
875
1022
1080
1166
y ions
y6
100
y7
Intensity
y5
b3
b4
b5
y2
y3
y8
y4
b8
b6
b7
b9
y9
0
m/z
250
500
750
1000
10Peptide Identification
- Given
- The mass of the parent ion
- The MS/MS spectrum
- Output
- The amino-acid sequence of the peptide
11Sequence Database Search
- Compares peptides from a protein sequence
database with spectra - Filter peptide candidates by
- Parent mass
- Digest motif
- Score each peptide against spectrum
- Generate all possible peptide fragments
- Match putative fragments with peaks
- Score and rank
12Peptide Candidates
- Parent ion
- Typically lt 3000 Da
- Tryptic Peptides
- Cut at K or R
- Search engines
- Dont handle gt 4 well
- Long peptides dont fragment well
- of distinct 30-mers (N30) upper bounds total
peptide content
13de Bruijn Sequences
- de Bruijn sequences represent all words of length
k from some alphabet A. - A 0,1, k 3 s 0001110100
- A 0,1, k 4 s 0000111101011001000
14de Bruijn Graph A 0,1, k 4
1
1
0
1
0
1
1
0
1
0
1
0
1
0
0
0
15de Bruijn Sequences Graphs
- de Bruijn graphs (k,A)
- Edges represent length k words from A
- Each node has
- in degree A
- out degree A
- Eulerian tour constructs de Bruijn sequence.
16Sequence Database Compression
- Construct sequence database / superstring that is
- Complete
- All 30-mers are present
- Correct
- No other 30-mers are present
- Compact
- No 30-mer is present more than once
17SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
18Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
19Sequence Databases CSBH-graphs
- Original sequences correspond to paths
ACDEFGI, ACDEFACG, DEFGEFGI
20Sequence Databases CSBH-graphs
- Complete
- All edges are on some path
- Correct
- Output path sequence only
- Compact
- No edge is used more than once
- C3 Path Set uses all edges exactly once.
21Sequence Databases CSBH-graphs
- Use each edge exactly once
ACDEFGEFGI, DEFACG
22Size of C3 Path Set for k-mers
- Each path costs
- (k-1)-mer path sequence EOS
- Sequence database with p paths
- Nk p k
- Minimize sequence database size by minimizing
number of paths - subject to C3 constraints
23Best case senario
- if CSBH-graph admits an Eulerian path.
- Sequence database size
- (k-1) Nk 1
- How many paths are required if the CSBH-graph is
not Eulerian?
24Non-Eulerian Components
- Net degree
- b(v) in edges - out edges
- Total degree surplus
- B ?b(v)gt0 b(v)
- For each path
- Start nodes net degree 1
- End nodes net degree -1
- Otherwise, net degree no change
- To reduce all nodes to net degree 0, must have at
least B paths.
25Components w/ B(C) 0
- Balanced component must have Eulerian tour, so
require exactly one path. - m balanced components
26 Paths Lower Bound
- The C3 path set must containat least B m
paths. - This lower bound is achievable!
- Just add (B - 1) restart edges to non-Eulerian
components
27Achieving Path Lower Bound
28k-mer superstrings
- Solution is optimal, for C3 constraints
- Polynomial time algorithm in length of original
sequences - General superstring problem
- Requires completeness only
- NP-hard Garey Johnson 79
- Approximable within a factor of 2.5
- MAX-SNP hard
29AA Sequence Databases
30Minimum Size C3 Sequence Database
31Relative Search Time
SP
UP
IPI-H
SP-VS
UP-VS
32Constraint Relaxation
- Why insist on compactness?
- What about 29-mers?
- Can we compress still further?
- Complete, Correct (C2)
- Use edges more than once, if helpful!
- How could this possibly help?
33C2 Superstring
- Sequence set with p paths Nk p k
- Costs k to use restart edges
- Restart edges from nodes v s.t. b(v) gt 0 to v
s.t. b(v) lt 0 - Reuse edges instead! provided the path length
is lt k - Transportation problem!
34C2 Superstring
S vb(v)gt0
Tvb(v)lt0
Cost k
35C2 Superstring
S vb(v)gt0
Tvb(v)lt0
Cost 0
Cost 0
Cost k
36C2 Superstring
S vb(v)gt0
Tvb(v)lt0
4
10
Shortcut paths
7
37C2 Superstring
S vb(v)gt0
Tvb(v)lt0
Cost 0
Cost 0
Cap 1
Cost 0
38C2 Superstring
39Extensions and Futher Work
- Better compression
- Enumerate tryptic peptides only
- Relax correctness constraint
- Other uses of CSBH graphs
- Compact representation of mer counts
- Implicit set operations on mers
- Structural graph properties
40Thanks
- Informatics Research _at_ ABI Celera
- Ross Lippert, Clark Mobarry, Bjarni Halldorsson
- UMIACS _at_ University of Maryland, CP
- V.S. Subrahmanian, Fritz McCall, Doan Pham
41Swiss-Prot
42Swiss-Prot Variant Annotations
43Swiss-Prot Variant Annotations
44Swiss-Prot Sequence
45Swiss-Prot
- VarSplic enumerates all variants, conflicts,
isoforms - Swiss-Prot sequence size
- 60 Mb
- VarSplic sequence size
- 95 Mb
- How many more peptide candidates?
46Swiss-Prot Variant Annotations
Feature viewer
Variants
47Swiss-Prot VarSplic Output
P13746-00-01-00 MAVMAPRTLLLLLSGALALTQTWAGSHSM
RYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P13746-01-01-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFI
AVGYVDDTQFVRF P13746-00-00-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-03-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-03-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-04-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVG
YVDDTQFVRF P13746-01-04-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVG
YVDDTQFVRF P13746-00-05-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-05-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-00-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-00-02-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF P13746-01-02-00
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVG
YVDDTQFVRF
48Swiss-Prot VarSplic Output
P13746-00-01-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAA
VMWRRKSS------DRKGGSYTQAASSDSAQ P13746-01-01-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKG
GSYTQAASSDSAQ P13746-00-00-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-00-03-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-03-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-04-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-04-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-05-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
TQAASSDSAQ P13746-01-05-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-01-00-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
TQAASSDSAQ P13746-00-02-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSY
SQAASSDSAQ P13746-01-02-00
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSY
SQAASSDSAQ
49Peptide Candidates
- At most 1 additional peptides in 1.6 times as
much sequence
50Mascot Running Time
51Total Search Time
UniProt
SwissProt
UniProt-VS
IPI-HUMAN
SwissProt-VS