Title: Novel Peptide Identification using ESTs and Genomic Sequence
1Novel Peptide Identification using ESTs and
Genomic Sequence
- Nathan Edwards
- Center for Bioinformatics and Computational
Biology - University of Maryland, College Park
2Sample Preparation for Peptide Identification
3Mass Spectrometer
- Time-Of-Flight (TOF)
- Quadrapole
- Ion-Trap
- MALDI
- Electro-SprayIonization (ESI)
4Single Stage MS
MS
m/z
5Tandem Mass Spectrometry(MS/MS)
m/z
Precursor selection
m/z
6Tandem Mass Spectrometry(MS/MS)
Precursor selection collision induced
dissociation (CID)
m/z
MS/MS
m/z
7Peptide Identification
- For each (likely) peptide sequence
- 1. Compute fragment masses
- 2. Compare with spectrum
- 3. Retain those that match well
- Peptide sequences from protein sequence databases
- Swiss-Prot, IPI, NCBIs nr, ...
- Automated, high-throughput peptide identification
in complex mixtures
8What goes missing?
- Known coding SNPs
- Novel coding mutations
- Alternative splicing isoforms
- Alternative translation start-sites
- Microexons
- Alternative translation frames
9Why should we care?
- Alternative splicing is the norm!
- Only 20-25K human genes
- Each gene makes many proteins
- Proteins have clinical implications
- Biomarker discovery
- Evidence for SNPs and alternative splicing stops
with transcription - Genomic assays, ESTs, mRNA sequence.
- Little hard evidence for translation start site
10Novel Splice Isoform
11Novel Splice Isoform
12Novel Frame
13Novel Frame
14Novel Mutation
Ala2?Pro associated with familial amyloid
polyneuropathy
15Novel Mutation
16Genomic Peptide Sequences
- Genomic DNA
- Exons introns, 6 frames, large (3Gb ? 6Gb)
- ESTs
- No introns, 6 frames, large (4Gb ? 8Gb)
- Used by gene, protein, and alternative splicing
annotation pipelines - Highly redundant, nucleotide error rate 1
17Compressed EST Database
- Six-frame translation of all ESTs
- Optionally, ESTs that map to a gene
- Eliminate ORFs lt 30 amino-acids
- Amino-acid 30-mers
- Observed in at least two ESTs
- Represent AA 30-mers in C3 FASTA database
- Complete, Correct, Compact
18SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
19Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
20Sequence Databases CSBH-graphs
- Original sequences correspond to paths
ACDEFGI, ACDEFACG, DEFGEFGI
21Sequence Databases CSBH-graphs
- All k-mers represented by an edge have the same
count
1
2
2
1
2
22cSBH-graphs
- Quickly determine those that occur twice
2
2
1
2
23Compressed-SBH-graph
2
2
1
2
ACDEFGI
24Compressed EST Database
- Gene centric compressed EST peptide sequence
database - 20,774 sequence entries
- 8Gb vs 223 Mb
- 35 fold compression
- 22 hours becomes 15 minutes
- E-values improve by similar factor!
- Makes routine EST searching feasible
- Search ESTs instead of IPI?
25Conclusions
- Peptides identify more than just proteins
- Compressed peptide sequence databases make
routine EST searching feasible - cSBH-graph edge counts C2/C3 enumeration
algorithms - Minimal FASTA representation of k-mer sets
26Collaborators
- Chau-Wen Tseng, Xue Wu
- Computer Science
- Catherine Fenselau, Crystal Harvey
- Biochemistry
- Calibrant Biosystems
- Thanks to PeptideAtlas, X!Tandem