Peptide Identification Statistics Pin the tail on the donkey - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Peptide Identification Statistics Pin the tail on the donkey

Description:

Pin the tail on the donkey? US HUPO: Bioinformatics for Proteomics ... Biological conclusions based on single-peptide proteins must show the spectrum. 47 ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 49
Provided by: nathanjoh
Category:

less

Transcript and Presenter's Notes

Title: Peptide Identification Statistics Pin the tail on the donkey


1
Peptide Identification StatisticsPin the tail on
the donkey?
  • US HUPO Bioinformatics for Proteomics
  • Nathan Edwards March 12, 2006

2
Peptide Identification
  • Peptide fragmentation by CID is poorly understood
  • MS/MS spectra represent incomplete information
    about amino-acid sequence
  • I/L, K/Q, GG/N,
  • Correct identifications dont come with a
    certificate!

3
Peptide Identification
  • High-throughput workflows demand we analyze all
    spectra, all the time.
  • Spectra may not contain enough information to be
    interpreted correctly
  • bad static on a cell phone
  • Peptides may not match our assumptions
  • its all Greek to me
  • Dont know is an acceptable answer!

4
Peptide Identification
  • We cant prove we are right
  • so can we prove we arent wrong?

5
Peptide Identification
  • We cant prove we are right
  • so can we prove we arent wrong?

NO!
6
Peptide Identification
  • We cant prove we are right
  • so can we prove we arent wrong?
  • The best we can do is to show our answer is
    better than guessing!

NO!
7
Better than guessing
  • Better implies comparison
  • Score or measure of degree of success
  • Guessing implies randomness
  • Probability and statistics

8
Pin the tail on the donkey
9
Probability Concepts
  • Throwing darts
  • One at a time
  • Blindfolded
  • Identically distributed?
  • Uniform distribution?
  • Mutually exclusive?
  • Independent?
  • Pr Dart hits x 0.05

10
Probability Concepts
  • Throwing darts
  • One at a time
  • Blindfolded
  • Three darts
  • Pr Hitting 20 3 times
  • 0.05 0.05 0.05
  • Pr Hit 20 at least twice
  • 0.007125 0.000125

11
Probability Concepts
12
Probability Concepts
  • Throwing darts
  • One at a time
  • Blindfolded
  • Three darts
  • Pr Hitting evens 3 times
  • Pr Hitting 1-10 3 times
  • 0.5 0.5 0.5
  • Pr Evens at least twice
  • 0.5

13
Probability Concepts
14
Probability Concepts
  • Throwing darts
  • One at a time
  • Blindfolded
  • 100 darts
  • Pr Hitting 20 3 times
  • 0.139575
  • Pr Hit 20 at least twice
  • 0.9629188

15
Probability Concepts
16
Match Score
  • Dartboard represents the mass range of the
    spectrum
  • Peaks of a spectrum are slices
  • Width of slice corresponds to mass tolerance
  • Darts represent
  • random masses
  • masses of fragments of a random peptide
  • masses of peptides of a random protein
  • masses of biomarkers from a random class
  • How many darts to we get to throw?

17
Match Score
  • What is the probability that we match at least 5
    peaks?

270
330
870
  • Same as the probability of hitting 20 at least 5
    times.

550
755
580
18
Match Score
  • Pr Match s peaks
  • Binomial( p , n )
  • Poisson( p n ), for small p and large n
  • p is prob. of random mass / peak match,
  • n is number of darts (fragments in our answer)

19
Match Score
  • Theoretical distribution
  • Used by OMSSA
  • Proposed, in various forms, by many.
  • Probability of random mass / peak match
  • IID (independent, identically distributed)
  • Based on match tolerance

20
Match Score
  • Theoretical distribution assumptions
  • Each dart is independent
  • Peaks are not related
  • Each dart is identically distributed
  • Chance of random mass / peak match is the same
    for all peaks

21
Tournament Size
100 people
1000 people
100 Darts, 20s
100000 people
10000 people
22
Tournament Size
100 people
1000 people
100 Darts, 20s
100000 people
10000 people
23
Number of Trials
  • Tournament size number of trials
  • Number of peptides tried
  • Related to sequence database size
  • Probability that a random match score is s
  • 1 Pr all match scores lt s
  • 1 Pr match score lt s Trials ()
  • Assumes IID!
  • Expect value
  • E Trials Pr match s
  • Corresponds to Bonferroni bound on ()

24
Better Dart Throwers
25
Better Random Models
  • Comparison with completely random model isnt
    really fair
  • Match scores for real spectra with real peptides
    obey rules
  • Even incorrect peptides match with non-random
    structure!

26
Better Random Models
  • Want to generate random fragment masses (darts)
    that behave more like the real thing
  • Some fragments are more likely than others
  • Some fragments depend on others
  • Theoretical models can only incorporate this
    structure to a limited extent.

27
Better Random Models
  • Generate random peptides
  • Real looking fragment masses
  • No theoretical model!
  • Must use empirical distribution
  • Usually require they have the correct precursor
    mass
  • Score function can model anything we like!

28
Better Random Models
Fenyo Beavis, Anal. Chem., 2003
29
Better Random Models
Fenyo Beavis, Anal. Chem., 2003
30
Better Random Models
  • Truly random peptides dont look much like real
    peptides
  • Just use peptides from the sequence database!
  • Caveats
  • Correct peptide (non-random) may be included
  • Peptides are not independent
  • Reverse sequence avoids only the first problem

31
Extrapolating from the Empirical Distribution
Fenyo Beavis, Anal. Chem., 2003
32
Extrapolating from the Empirical Distribution
  • Often, the empirical shape is consistent with a
    theoretical model

Fenyo Beavis, Anal. Chem., 2003
Geer et al., J. Proteome Research, 2004
33
Peptide Prophet
  • From the Institute for Systems Biology
  • Keller et al., Anal. Chem. 2002
  • Re-analysis of SEQUEST results
  • Spectra are trials (NOT peptides!)
  • Assumes that many of the spectra are not
    correctly identified

34
Peptide Prophet
Keller et al., Anal. Chem. 2002
Distribution of spectral scores in the results
35
Peptide Prophet
  • Assumes a bimodal distribution of scores, with a
    particular shape
  • Ignores database size
  • but it is included implicitly
  • Like empirical distribution for peptide sampling,
    can be applied to any score function
  • Can be applied to any search engines results

36
Peptide Prophet
  • Caveats
  • Are spectra scores sampled from the same
    distribution?
  • Is there enough correct identifications for
    second peak?
  • Are spectra independent observations?
  • Are distributions appropriately shaped?
  • Huge improvement over raw SEQUEST results

37
Peptides to Proteins
Nesvizhskii et al., Anal. Chem. 2003
38
Peptides to Proteins
39
Peptides to Proteins
  • A peptide sequence may occur in many different
    protein sequences
  • Variants, paralogues, protein families
  • Separation, digestion and ionization is not well
    understood
  • Proteins in sequence database are extremely
    non-random, and very dependent

40
Peptides to Proteins
41
Peptides to Proteins
  • Mascot
  • Protein score is sum of peptide scores
  • Assumes peptide identifications are independent!
  • SEQUEST
  • Keeps only one of the proteins for each peptide?

42
Peptides to Proteins
  • Peptide Prophet
  • Nesvizhskii, et al. Anal. Chem 2003
  • Models probability that a protein is correct
    based on
  • Probability that its peptides are correct
  • Models probability that a peptide is correct
    based on
  • Probability that its proteins are correct
  • Proteins with one high-probability peptide are
    not eliminated
  • but are down-weighted
  • Assumes identification probabilities from the
    same protein are independent (like Mascot)

43
Peptides to Proteins
  • Best available method, to date, is Protein
    Prophet.
  • The problem will only get worse, as we search
    variants and isoform sequences
  • Proteins do not have a single sequence!
  • Peptide identification is not protein
    identification!

44
Publication Guidelines
45
Publication Guidelines
  • Computational parameters
  • Spectral processing
  • Sequence database
  • Search program
  • Statistical analysis
  • Number of peptides per protein
  • Each peptide sequence counts once!
  • Multiple forms of the same peptide count once!

46
Publication Guidelines
  • Single-peptide proteins must be explicitly
    justified by
  • Peptide sequence
  • N and C terminal amino-acids
  • Precursor mass and charge
  • Peptide Scores
  • Multiple forms of the peptide counted once!
  • Biological conclusions based on single-peptide
    proteins must show the spectrum

47
Publication Guidelines
  • More stringent requirements for PMF data
    analysis
  • Similar to that for tandem mass spectra
  • Management of protein redundancy
  • Peptides identified from a different species?
  • Spectra submission encouraged

48
Summary
  • Could guessing be as effective as a search?
  • More guesses improves the best guess
  • Better guessers help us be more discriminating
  • Independent observations only count if they are
    independent!
  • Peptide to proteins is not as simple as it seems
  • Publication guidelines reflect sound statistical
    principles.
Write a Comment
User Comments (0)
About PowerShow.com