Title: Novel Peptide Identification using ESTs and Sequence Database Compression
1Novel Peptide Identification using ESTs and
Sequence Database Compression
- Nathan Edwards
- Center for Bioinformatics and Computational
Biology - University of Maryland, College Park
2What is missing from protein sequence databases?
- Known coding SNPs
- Novel coding mutations
- Alternative splicing isoforms
- Alternative translation start-sites
- Microexons
- Alternative translation frames
3Why dont we see more novel peptides?
- Tandem mass spectrometry doesnt discriminate
against novel peptides......but protein
sequence databases do! - Searching traditional protein sequence databases
biases the results towards well-understood
protein isoforms!
4Novel Splice Isoform
5Novel Splice Isoform
6Novel Mutation
Ala2?Pro associated with familial amyloid
polyneuropathy
7Novel Mutation
8Searching ESTs
- Proposed long ago
- Yates, Eng, and McCormack Anal Chem, 95.
- Now
- Protein sequences are sufficient for protein
identification - Computationally expensive/infeasible
- Difficult to interpret
- Make EST searching feasible for routine searching
to discover novel peptides.
9Searching Expressed Sequence Tags (ESTs)
- Pros
- No introns!
- Primary splicing evidence for annotation
pipelines - Evidence for dbSNP
- Often derived from clinical cancer samples
- Cons
- No frame
- Large (8Gb)
- Untrusted by annotation pipelines
- Highly redundant
- Nucleotide error rate 1
10Other Search Strategies
- Genome Corrected ESTs
- Large (2Gb)
- Controls for nucleotide error rate
- Polymorphism lost, potential errors introduced
- Genome Clustered ESTs
- Small, Gene model
- Convergence to well-understood isoforms
- Controls nucleotide error rate
- Full-Length mRNAs
- Incomplete gene coverage, most are already in
IPI
11Other Search Strategies
- Genome
- Large (6Gb), lots of non-coding DNA
- Find novel ORFs, no sampling bias
- Miss spliced peptide sequences.
- Genscan Exons
- Small, find novel ORFs.
- Miss spliced peptide sequences.
- How should we interpret peptide identifications
with no mRNA evidence?
12Compressed EST Peptide Sequence Database
- For all ESTs mapped to a UniGene gene
- Six-frame translation
- Eliminate ORFs lt 30 amino-acids
- Eliminate amino-acid 30-mers observed once
- Compress to C2 FASTA database
- Complete, Correct for amino-acid 30-mers
- Gene-centric peptide sequence database
- Size lt 3 of naïve enumeration, 20774 FASTA
entries - Running time 1 of naïve enumeration search
- E-values 2 of naïve enumeration search results
13Compressed EST Peptide Sequence Database
- For all ESTs mapped to a UniGene gene
- Six-frame translation
- Eliminate ORFs lt 30 amino-acids
- Eliminate amino-acid 30-mers observed once
- Compress to C2 FASTA database
- Complete, Correct for amino-acid 30-mers
- Gene-centric peptide sequence database
- Size lt 3 of naïve enumeration, 20774 FASTA
entries - Running time 1 of naïve enumeration search
- E-values 2 of naïve enumeration search results
14SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
15Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
16Sequence Databases CSBH-graphs
- Original sequences correspond to paths
ACDEFGI, ACDEFACG, DEFGEFGI
17Sequence Databases CSBH-graphs
- All k-mers represented by an edge have the same
count
1
2
2
1
2
18CSBH-graphs
- Quickly determine which k-mers occur at least
twice
2
2
1
2
19de Bruijn Sequences
- de Bruijn sequences represent all words of length
k from some alphabet A. - A 0,1, k 3 s 0001110100
- A 0,1, k 4 s 0000111101011001000
20de Bruijn Graph A 0,1, k 4
1
1
0
1
0
1
1
0
1
0
1
0
1
0
0
0
21Correct, Complete, Compact (C3) Enumeration
- Set of paths that use each edge exactly once
ACDEFGEFGI, DEFACG
22Correct, Complete (C2) Enumeration
- Set of paths that use each edge at least once
ACDEFGEFGI, DEFACG
23Patching the CSBH-graph
- Use artificial edges to fix unbalanced nodes
24Patching the CSBH-graph
- Use matching-style formulations to choose
artificial edges - Optimal C2/C3 enumeration in polynomial time.
- Chinese Postman Problem
- Edmonds and Johnson, 73
- l-tuple DNA sequencing
- Pevzner, 89
- Shortest (Common) Superstring
- MAX-SNP-hard, 2.5 approx algorithm
25C3 Enumeration
in-out
in-out
Cost k
26C3 Enumeration
in-out
in-out
Cost 0
Cost 0
Cost k
27Reusing Edges
- ACDEHAC, ACDFHAC, ACDGHACD
28Reusing Edges
- C3 ACDEHACDFHAC, ACDGHACD
29Reusing Edges
30C2 Enumeration
in-out
in-out
4
10
Shortcut paths
7
31Implementation
- CSBH-graph construction
- Determine non-trivial nodes directly
- Consecutive non-trivial nodes determine edges
- C3/C2 enumeration
- C3 Trivial assignment of artificial edges
- C2 Depth-first search Goldbergs CS2
min cost flow code - Eulerian path algorithm
- Can be applied to entire EST database
- Condor grid and PBS cluster for CSBH-graph
construction - Large memory machine for C3/C2 enumeration
32Conclusions
- Peptides identify more than just proteins
- Compressed peptide sequence databases makes
routine EST searching feasible - Currently available for download
- Can include other sources of peptide sequence at
little additional cost. - CSBH-graph edge counts C2/C3 enumeration
algorithms - Minimal FASTA representation of k-mer sets
33Acknowledgements
- Chau-Wen Tseng, Xue Wu
- UMCP Computer Science
- Catherine Fenselau, Crystal Harvey
- UMCP Biochemistry
- Calibrant Biosystems
- PeptideAtlas, HUPO PPP, X!Tandem
- Funding National Cancer Institute