Title: Peptide Identification Statistics Pin the tail on the donkey
1Peptide Identification StatisticsPin the tail on
the donkey?
- US HUPO Bioinformatics for Proteomics
- Nathan Edwards March 12, 2006
2Peptide Identification
- Peptide fragmentation by CID is poorly understood
- MS/MS spectra represent incomplete information
about amino-acid sequence - I/L, K/Q, GG/N,
- Correct identifications dont come with a
certificate!
3Peptide Identification
- High-throughput workflows demand we analyze all
spectra, all the time. - Spectra may not contain enough information to be
interpreted correctly - bad static on a cell phone
- Peptides may not match our assumptions
- its all Greek to me
- Dont know is an acceptable answer!
4Peptide Identification
- We cant prove we are right
- so can we prove we arent wrong?
5Peptide Identification
- We cant prove we are right
- so can we prove we arent wrong?
NO!
6Peptide Identification
- We cant prove we are right
- so can we prove we arent wrong?
- The best we can do is to show our answer is
better than guessing!
NO!
7Better than guessing
- Better implies comparison
- Score or measure of degree of success
- Guessing implies randomness
- Probability and statistics
8Pin the tail on the donkey
9Probability Concepts
- Throwing darts
- One at a time
- Blindfolded
- Identically distributed?
- Uniform distribution?
- Mutually exclusive?
- Independent?
- Pr Dart hits x 0.05
10Probability Concepts
- Throwing darts
- One at a time
- Blindfolded
- Three darts
- Pr Hitting 20 3 times
- 0.05 0.05 0.05
- Pr Hit 20 at least twice
- 0.007125 0.000125
11Probability Concepts
12Probability Concepts
- Throwing darts
- One at a time
- Blindfolded
- Three darts
- Pr Hitting evens 3 times
- Pr Hitting 1-10 3 times
- 0.5 0.5 0.5
- Pr Evens at least twice
- 0.5
13Probability Concepts
14Probability Concepts
- Throwing darts
- One at a time
- Blindfolded
- 100 darts
- Pr Hitting 20 3 times
- 0.139575
- Pr Hit 20 at least twice
- 0.9629188
15Probability Concepts
16Match Score
- Dartboard represents the mass range of the
spectrum - Peaks of a spectrum are slices
- Width of slice corresponds to mass tolerance
- Darts represent
- random masses
- masses of fragments of a random peptide
- masses of peptides of a random protein
- masses of biomarkers from a random class
- How many darts to we get to throw?
17Match Score
- What is the probability that we match at least 5
peaks?
270
330
870
- Same as the probability of hitting 20 at least 5
times.
550
755
580
18Match Score
- Pr Match s peaks
- Binomial( p , n )
- Poisson( p n ), for small p and large n
- p is prob. of random mass / peak match,
- n is number of darts (fragments in our answer)
19Match Score
- Theoretical distribution
- Used by OMSSA
- Proposed, in various forms, by many.
- Probability of random mass / peak match
- IID (independent, identically distributed)
- Based on match tolerance
20Match Score
- Theoretical distribution assumptions
- Each dart is independent
- Peaks are not related
- Each dart is identically distributed
- Chance of random mass / peak match is the same
for all peaks
21Tournament Size
100 people
1000 people
100 Darts, 20s
100000 people
10000 people
22Tournament Size
100 people
1000 people
100 Darts, 20s
100000 people
10000 people
23Number of Trials
- Tournament size number of trials
- Number of peptides tried
- Related to sequence database size
- Probability that a random match score is s
- 1 Pr all match scores lt s
- 1 Pr match score lt s Trials ()
- Assumes IID!
- Expect value
- E Trials Pr match s
- Corresponds to Bonferroni bound on ()
24Better Dart Throwers
25Better Random Models
- Comparison with completely random model isnt
really fair - Match scores for real spectra with real peptides
obey rules - Even incorrect peptides match with non-random
structure!
26Better Random Models
- Want to generate random fragment masses (darts)
that behave more like the real thing - Some fragments are more likely than others
- Some fragments depend on others
- Theoretical models can only incorporate this
structure to a limited extent.
27Better Random Models
- Generate random peptides
- Real looking fragment masses
- No theoretical model!
- Must use empirical distribution
- Usually require they have the correct precursor
mass - Score function can model anything we like!
28Better Random Models
Fenyo Beavis, Anal. Chem., 2003
29Better Random Models
Fenyo Beavis, Anal. Chem., 2003
30Better Random Models
- Truly random peptides dont look much like real
peptides - Just use peptides from the sequence database!
- Caveats
- Correct peptide (non-random) may be included
- Peptides are not independent
- Reverse sequence avoids only the first problem
31Extrapolating from the Empirical Distribution
Fenyo Beavis, Anal. Chem., 2003
32Extrapolating from the Empirical Distribution
- Often, the empirical shape is consistent with a
theoretical model
Fenyo Beavis, Anal. Chem., 2003
Geer et al., J. Proteome Research, 2004
33Peptide Prophet
- From the Institute for Systems Biology
- Keller et al., Anal. Chem. 2002
- Re-analysis of SEQUEST results
- Spectra are trials (NOT peptides!)
- Assumes that many of the spectra are not
correctly identified
34Peptide Prophet
Keller et al., Anal. Chem. 2002
Distribution of spectral scores in the results
35Peptide Prophet
- Assumes a bimodal distribution of scores, with a
particular shape - Ignores database size
- but it is included implicitly
- Like empirical distribution for peptide sampling,
can be applied to any score function - Can be applied to any search engines results
36Peptide Prophet
- Caveats
- Are spectra scores sampled from the same
distribution? - Is there enough correct identifications for
second peak? - Are spectra independent observations?
- Are distributions appropriately shaped?
- Huge improvement over raw SEQUEST results
37Peptides to Proteins
Nesvizhskii et al., Anal. Chem. 2003
38Peptides to Proteins
39Peptides to Proteins
- A peptide sequence may occur in many different
protein sequences - Variants, paralogues, protein families
- Separation, digestion and ionization is not well
understood - Proteins in sequence database are extremely
non-random, and very dependent
40Peptides to Proteins
41Peptides to Proteins
- Mascot
- Protein score is sum of peptide scores
- Assumes peptide identifications are independent!
- SEQUEST
- Keeps only one of the proteins for each peptide?
42Peptides to Proteins
- Peptide Prophet
- Nesvizhskii, et al. Anal. Chem 2003
- Models probability that a protein is correct
based on - Probability that its peptides are correct
- Models probability that a peptide is correct
based on - Probability that its proteins are correct
- Proteins with one high-probability peptide are
not eliminated - but are down-weighted
- Assumes identification probabilities from the
same protein are independent (like Mascot)
43Peptides to Proteins
- Best available method, to date, is Protein
Prophet. - The problem will only get worse, as we search
variants and isoform sequences - Proteins do not have a single sequence!
- Peptide identification is not protein
identification!
44Publication Guidelines
45Publication Guidelines
- Computational parameters
- Spectral processing
- Sequence database
- Search program
- Statistical analysis
- Number of peptides per protein
- Each peptide sequence counts once!
- Multiple forms of the same peptide count once!
46Publication Guidelines
- Single-peptide proteins must be explicitly
justified by - Peptide sequence
- N and C terminal amino-acids
- Precursor mass and charge
- Peptide Scores
- Multiple forms of the peptide counted once!
- Biological conclusions based on single-peptide
proteins must show the spectrum
47Publication Guidelines
- More stringent requirements for PMF data
analysis - Similar to that for tandem mass spectra
- Management of protein redundancy
- Peptides identified from a different species?
- Spectra submission encouraged
48Summary
- Could guessing be as effective as a search?
- More guesses improves the best guess
- Better guessers help us be more discriminating
- Independent observations only count if they are
independent! - Peptide to proteins is not as simple as it seems
- Publication guidelines reflect sound statistical
principles.