Peptide Identification Statistics Pin the tail on the donkey - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Peptide Identification Statistics Pin the tail on the donkey

Description:

Pin the tail on the donkey? US HUPO: Bioinformatics for Proteomics ... Biological conclusions based on single-peptide proteins must show the spectrum. 47 ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 49

Provided by: nathanjoh

Category:

more less

Transcript and Presenter's Notes

Title: Peptide Identification Statistics Pin the tail on the donkey

1
Peptide Identification StatisticsPin the tail on
the donkey?

US HUPO Bioinformatics for Proteomics
Nathan Edwards March 12, 2006

2
Peptide Identification

Peptide fragmentation by CID is poorly understood
MS/MS spectra represent incomplete information
about amino-acid sequence
I/L, K/Q, GG/N,
Correct identifications dont come with a
certificate!

3
Peptide Identification

High-throughput workflows demand we analyze all
spectra, all the time.
Spectra may not contain enough information to be
interpreted correctly
bad static on a cell phone
Peptides may not match our assumptions
its all Greek to me
Dont know is an acceptable answer!

4
Peptide Identification

We cant prove we are right
so can we prove we arent wrong?

5
Peptide Identification

We cant prove we are right
so can we prove we arent wrong?

NO!
6
Peptide Identification

We cant prove we are right
so can we prove we arent wrong?
The best we can do is to show our answer is
better than guessing!

NO!
7
Better than guessing

Better implies comparison
Score or measure of degree of success
Guessing implies randomness
Probability and statistics

8
Pin the tail on the donkey
9
Probability Concepts

Throwing darts
One at a time
Blindfolded
Identically distributed?
Uniform distribution?
Mutually exclusive?
Independent?
Pr Dart hits x 0.05

10
Probability Concepts

Throwing darts
One at a time
Blindfolded
Three darts
Pr Hitting 20 3 times
0.05 0.05 0.05
Pr Hit 20 at least twice
0.007125 0.000125

11
Probability Concepts
12
Probability Concepts

Throwing darts
One at a time
Blindfolded
Three darts
Pr Hitting evens 3 times
Pr Hitting 1-10 3 times
0.5 0.5 0.5
Pr Evens at least twice
0.5

13
Probability Concepts
14
Probability Concepts

Throwing darts
One at a time
Blindfolded
100 darts
Pr Hitting 20 3 times
0.139575
Pr Hit 20 at least twice
0.9629188

15
Probability Concepts
16
Match Score

Dartboard represents the mass range of the
spectrum
Peaks of a spectrum are slices
Width of slice corresponds to mass tolerance
Darts represent
random masses
masses of fragments of a random peptide
masses of peptides of a random protein
masses of biomarkers from a random class
How many darts to we get to throw?

17
Match Score

What is the probability that we match at least 5
peaks?

270
330
870

Same as the probability of hitting 20 at least 5
times.

550
755
580
18
Match Score

Pr Match s peaks
Binomial( p , n )
Poisson( p n ), for small p and large n
p is prob. of random mass / peak match,
n is number of darts (fragments in our answer)

19
Match Score

Theoretical distribution
Used by OMSSA
Proposed, in various forms, by many.
Probability of random mass / peak match
IID (independent, identically distributed)
Based on match tolerance

20
Match Score

Theoretical distribution assumptions
Each dart is independent
Peaks are not related
Each dart is identically distributed
Chance of random mass / peak match is the same
for all peaks

21
Tournament Size
100 people
1000 people
100 Darts, 20s
100000 people
10000 people
22
Tournament Size
100 people
1000 people
100 Darts, 20s
100000 people
10000 people
23
Number of Trials

Tournament size number of trials
Number of peptides tried
Related to sequence database size
Probability that a random match score is s
1 Pr all match scores lt s
1 Pr match score lt s Trials ()
Assumes IID!
Expect value
E Trials Pr match s
Corresponds to Bonferroni bound on ()

24
Better Dart Throwers
25
Better Random Models

Comparison with completely random model isnt
really fair
Match scores for real spectra with real peptides
obey rules
Even incorrect peptides match with non-random
structure!

26
Better Random Models

Want to generate random fragment masses (darts)
that behave more like the real thing
Some fragments are more likely than others
Some fragments depend on others
Theoretical models can only incorporate this
structure to a limited extent.

27
Better Random Models

Generate random peptides
Real looking fragment masses
No theoretical model!
Must use empirical distribution
Usually require they have the correct precursor
mass
Score function can model anything we like!

28
Better Random Models
Fenyo Beavis, Anal. Chem., 2003
29
Better Random Models
Fenyo Beavis, Anal. Chem., 2003
30
Better Random Models

Truly random peptides dont look much like real
peptides
Just use peptides from the sequence database!
Caveats
Correct peptide (non-random) may be included
Peptides are not independent
Reverse sequence avoids only the first problem

31
Extrapolating from the Empirical Distribution
Fenyo Beavis, Anal. Chem., 2003
32
Extrapolating from the Empirical Distribution

Often, the empirical shape is consistent with a
theoretical model

Fenyo Beavis, Anal. Chem., 2003
Geer et al., J. Proteome Research, 2004
33
Peptide Prophet

From the Institute for Systems Biology
Keller et al., Anal. Chem. 2002
Re-analysis of SEQUEST results
Spectra are trials (NOT peptides!)
Assumes that many of the spectra are not
correctly identified

34
Peptide Prophet
Keller et al., Anal. Chem. 2002
Distribution of spectral scores in the results
35
Peptide Prophet

Assumes a bimodal distribution of scores, with a
particular shape
Ignores database size
but it is included implicitly
Like empirical distribution for peptide sampling,
can be applied to any score function
Can be applied to any search engines results

36
Peptide Prophet

Caveats
Are spectra scores sampled from the same
distribution?
Is there enough correct identifications for
second peak?
Are spectra independent observations?
Are distributions appropriately shaped?
Huge improvement over raw SEQUEST results

37
Peptides to Proteins
Nesvizhskii et al., Anal. Chem. 2003
38
Peptides to Proteins
39
Peptides to Proteins

A peptide sequence may occur in many different
protein sequences
Variants, paralogues, protein families
Separation, digestion and ionization is not well
understood
Proteins in sequence database are extremely
non-random, and very dependent

40
Peptides to Proteins
41
Peptides to Proteins

Mascot
Protein score is sum of peptide scores
Assumes peptide identifications are independent!
SEQUEST
Keeps only one of the proteins for each peptide?

42
Peptides to Proteins

Peptide Prophet
Nesvizhskii, et al. Anal. Chem 2003
Models probability that a protein is correct
based on
Probability that its peptides are correct
Models probability that a peptide is correct
based on
Probability that its proteins are correct
Proteins with one high-probability peptide are
not eliminated
but are down-weighted
Assumes identification probabilities from the
same protein are independent (like Mascot)

43
Peptides to Proteins