Title: Improving the Sensitivity of Peptide Identification
1Improving the Sensitivityof Peptide
Identification
- Nathan Edwards
- Department of Biochemistry and Molecular
Cellular Biology - Georgetown University Medical Center
- Xue Wu, Chau-Wen Tseng
- Department of Computer Science
- University of Maryland, College Park
2Lost peptide identifications
- Missing from the sequence database
- Search engine strengths, weaknesses, quirks
- Poor score or statistical significance
- Thorough search takes too long
3Lost peptide identifications
- Missing from the sequence database
- Build exhaustive peptide sequence databases
- Search engine strengths, weaknesses, quirks
- Use multiple search engines and combine results
- Poor score or statistical significance
- Use spectral-matching to identify weak spectra
- Use search-engine consensus to boost confidence
- Use machine-learning to distinguish true from
false - Thorough search takes too long
- Harness the power of heterogeneous computational
grids
4Peptide Sequence Databases
- All peptides at most 30 amino-acids long from
- IPI and all IPI constituent protein sequences
- IPI, HInvDB, VEGA, UniProt, EMBL, RefSeq, GenBank
- SwissProt variants, conflicts, splices, and
signal peptide truncations. - Genbank and RefSeq mRNA sequence
- 3 frame translation
- GenBank EST and HTC sequences
- 6 frame translation and found in at least 2
sequences - Grouped by UniGene cluster and compressed.
5Peptide Sequence Databases
- Formatted as a FASTA sequence database
- Easy integration with search engines.
- One entry per gene/cluster.
- Automated rebuild every few months.
Organism Size (AA) Size (Entries)
Human 209Mb 75,043
Mouse 151Mb 55,929
Rat 67Mb 43,211
Zebra-fish 90Mb 47,922
6Spectral Matching with HMMs
7Spectral Matching with HMMs
8Hidden Markov Model
Delete
Insert
Ion
(m/z,int) pair emitted by ion insert states
9Boosting Identification Sensitivity
10Spectral Matching of Peptide Variants
DFLAGGIAAAISK
DFLAGGVAAAISK
11Spectral Matching Extrapolation
12Comparison of search engine results
- No single score is comprehensive
- Search engines disagree
- Many spectra lack confident peptide assignment
Searle et al. JPR 7(1), 2008
13Combining search engine results harder than it
looks!
- Consensus boosts confidence, but...
- How to assess statistical significance?
- Gain specificity, but lose sensitivity!
- Incorrect identifications are correlated too!
- How to handle weak identifications?
- Consensus vs disagreement vs abstention
- Threshold at some significance?
- We apply unsupervised machine-learning....
- Lots of related work unified in a single
framework.
14Supervised Learning
15Unsupervised Learning
16PepArML Combining Results
17Unsupervised Learning
U-TMO
U-TMO
C-TMO
H
False Positive Rate
Iteration
18Searching for Consensus
- Search engine quirks can destroy consensus
- Initial methionine loss as tryptic peptide
- Charge state enumeration or guessing
- X!Tandem's refinement mode
- Pyro-Gln, Pyro-Glu modifications
- Difficulty tracking spectrum identifiers
- Precursor mass tolerance (Da vs ppm)
- Decoy searches must be identical!
19Configuring for Consensus
- Search engine configuration can be difficult
- Correct spectral format
- Search parameter files and command-line
- Pre-processed sequence databases.
- Tracking spectrum identifiers
- Extracting peptide identifications, especially
modifications and protein identifiers
20Peptide Identification Meta-Search Parameters
- Instrument
- Precursor Tolerance
- Fragment Tolerance
- Max. Charge
- Sequence Database
- Target/Decoy
- Modification
- Fixed/Variable
- Amino-Acids
- Position
- Delta
- Proteolytic Agent
- Motif
- Peptide Candidates
- Termini Specificity
- Precursor Tolerance
- Missed cleavages
- Charge State Handling
- 13C Peaks
- Search Engines
- Mascot, X!Tandem
- OMSSA, MyriMatch
21Peptide Identification Meta-Search
- Simple unified search interface for
- Mascot, X!Tandem
- OMSSA, Myrimatch
- Automatic decoy searches
- Automatic spectrumfile "chunking"
- Automatic scheduling
- Serial, Multi-Processor,
- Cluster, Grid
22Peptide Identification Meta-Search
Heterogeneous compute resources
NSF TeraGrid 1000 CPUs
Edwards Lab Scheduler 48 CPUs
Secure communication
Simple searchrequest
UMIACS 250 CPUs
23Conclusions
- Improve sensitivity of peptide identification
- Exhaustive peptide sequence databases
- Machine-learning for matching and combining
- Meta-search tools maximize consensus
- Grid-computing to achieve thorough search
24Acknowledgements
- Catherine Fenselau
- University of Maryland Biochemistry
- Funding NIH/NCI, USDA/ARS