Title: Peptide Identification Statistics Pin the tail on the donkey
1Peptide Identification StatisticsPin the tail on
the donkey?
- US HUPO Bioinformatics for Proteomics
- Nathan Edwards March 12, 2005
2Peptide Identification
- Peptide fragmentation by CID is poorly understood
- MS/MS spectra represent incomplete information
about amino-acid sequence - I/L, K/Q, GG/N,
- Correct identifications dont come with a
certificate!
3Peptide Identification
- High-throughput workflows demand we analyze all
spectra, all the time. - Spectra may not contain enough information to be
interpreted correctly - bad static on a cell phone
- Peptides may not match our assumptions
- its all Greek to me
- Dont know is an acceptable answer!
4Peptide Identification
- We cant prove we are right
- so can we prove we arent wrong?
5Peptide Identification
- We cant prove we are right
- so can we prove we arent wrong?
NO!
6Peptide Identification
- We cant prove we are right
- so can we prove we arent wrong?
- The best we can do is to show our answer is
better than guessing!
NO!
7Better than guessing
- Better implies comparison
- Score or measure of degree of success
- Guessing implies randomness
- Probability and statistics
8Pin the tail on the donkey
9Probability Concepts
- Throwing darts
- One at a time
- Blindfolded
- Identically distributed?
- Uniform distribution?
- Mutually exclusive?
- Independent?
- Pr Dart hits x 0.05
10Probability Concepts
- Throwing darts
- One at a time
- Blindfolded
- Three darts
- Pr Hitting 20 3 times
- 0.05 0.05 0.05
- Pr Hit 20 at least twice
- 0.007125 0.000125
11Probability Concepts
12Probability Concepts
- Throwing darts
- One at a time
- Blindfolded
- Three darts
- Pr Hitting evens 3 times
- Pr Hitting 1-10 3 times
- 0.5 0.5 0.5
- Pr Evens at least twice
- 0.5
13Probability Concepts
14Probability Concepts
- Throwing darts
- One at a time
- Blindfolded
- 100 darts
- Pr Hitting 20 3 times
- 0.139575
- Pr Hit 20 at least twice
- 0.9629188
15Probability Concepts
16Match Score
- Dartboard is peaks in a spectrum
- Each dart is a peptide fragment
- Pr Match s peaks
- Binomial( p , n )
- Poisson( p n ), for small p and large n
- p is prob. of fragment / peak match,
- n is number of fragments
17Match Score
- Theoretical distribution
- Used by OMSSA
- Proposed, in various forms, by many.
- Probability of fragment / peak match
- IID (independent, identically distributed)
- Based on match tolerance
- Can use fragments or peaks as darts!
18Match Score
- Theoretical distribution assumptions
- Each dart is independent
- Peaks are not related
- Each dart is identically distributed
- Chance of fragment / peak match is the same for
all peaks and fragments
19Tournament Size
100 people
1000 people
100 Darts, 20s
100000 people
10000 people
20Tournament Size
100 people
1000 people
100 Darts, 20s
100000 people
10000 people
21Number of Trials
- Tournament size number of trials
- Number of peptides tried
- Related to sequence database size
- Probability that a random match score is s
- 1 Pr all match scores lt s
- 1 Pr match score lt s Trials ()
- Assumes IID!
- Expect value
- E Trials Pr match s
- Corresponds to Bonferroni bound on ()
22Better Dart Throwers
23Better Random Models
- Comparison with completely random model isnt
really fair - Match scores for real spectra with real peptides
obey rules - Even incorrect peptides match with non-random
structure!
24Better Random Models
- Want to generate random fragment masses (darts)
that behave more like the real thing - Some fragments are more likely than others
- Some fragments depend on others
- Theoretical models can only incorporate this
structure to a limited extent. - Cannot model the properties of a particular
peptide! - Must capture behavior of fragments in general
25Better Random Models
- Generate random peptides
- Real looking fragment masses
- No theoretical model!
- Must use empirical distribution
- Usually require they have the correct precursor
mass - Score function can model anything we like!
26Better Random Models
Fenyo Beavis, Anal. Chem., 2003
27Better Random Models
Fenyo Beavis, Anal. Chem., 2003
28Better Random Models
- Truly random peptides dont look much like real
peptides - Just use peptides from the sequence database!
- Caveats
- Correct peptide (non-random) may be included
- Peptides are not independent
- Reverse sequence avoids only the first problem
29Extrapolating from the Empirical Distribution
Fenyo Beavis, Anal. Chem., 2003
30Extrapolating from the Empirical Distribution
- Often, the empirical shape is consistent with a
theoretical model
Fenyo Beavis, Anal. Chem., 2003
Geer et al., J. Proteome Research, 2004
31Peptide Prophet
- From the Institute for Systems Biology
- Keller et al., Anal. Chem. 2002
- Re-analysis of SEQUEST results
- Spectra are trials (NOT peptides!)
- Assumes that many of the spectra are not
correctly identified
32Peptide Prophet
Keller et al., Anal. Chem. 2002
Distribution of spectral scores in the results
33Peptide Prophet
- Assumes a bimodal distribution of scores, with a
particular shape - Ignores database size
- but it is included implicitly
- Like empirical distribution for peptide sampling,
can be applied to any score function - Can be applied to any search engines results
34Peptide Prophet
- Caveats
- Are spectra scores sampled from the same
distribution? - Is there enough correct identifications for
second peak? - Are spectra independent observations?
- Are distributions appropriately shaped?
- Huge improvement over raw SEQUEST results
35Peptides to Proteins
Nesvizhskii et al., Anal. Chem. 2003
36Peptides to Proteins
37Peptides to Proteins
- A peptide sequence may occur in many different
protein sequences - Variants, paralogues, protein families
- Separation, digestion and ionization is not well
understood - Proteins in sequence database are extremely
non-random, and very dependent
38Peptides to Proteins
39Peptides to Proteins
- Mascot
- Protein score is sum of peptide scores
- Assumes peptide identifications are independent!
- SEQUEST
- Keeps only one of the proteins for each peptide?
40Peptides to Proteins
- Peptide Prophet
- Nesvizhskii, et al. Anal. Chem 2003
- Models probability that a protein is correct
based on - Probability that its peptides are correct
- Models probability that a peptide is correct
based on - Probability that its proteins are correct
- Proteins with one high-probability peptide are
not eliminated - but are down-weighted
- Assumes identification probabilities from the
same protein are independent (like Mascot)
41Peptides to Proteins
- Best available method, to date, is Protein
Prophet. - The problem will only get worse, as we search
variants and isoform sequences - Proteins do not have a single sequence!
- Peptide identification is not protein
identification!
42Publication Guidelines
43Publication Guidelines
- Computational parameters
- Spectral processing
- Sequence database
- Search program
- Statistical analysis
- Number of peptides per protein
- Each peptide sequence counts once!
- Multiple forms of the same peptide count once!
44Publication Guidelines
- Single-peptide proteins must be explicitly
justified by - Peptide sequence
- N and C terminal amino-acids
- Precursor mass and charge
- Peptide Scores
- Multiple forms of the peptide counted once!
- Biological conclusions based on single-peptide
proteins must show the spectrum
45Publication Guidelines
- More stringent requirements for PMF data
analysis - Similar to that for tandem mass spectra
- Management of protein redundancy
- Peptides identified from a different species?
- Spectra submission encouraged
46Summary
- Could guessing be as effective as a search?
- More guesses improves the best guess
- Better guessers help us be more discriminating
- Independent observations only count if they are
independent! - Peptide to proteins is not as simple as it seems
- Publication guidelines reflect sound statistical
principles.