Chapter 2. Protein identification from MSMS data - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

Chapter 2. Protein identification from MSMS data

Description:

Chapter 2. Protein identification from MSMS data – PowerPoint PPT presentation

Number of Views:107
Avg rating:3.0/5.0
Slides: 79
Provided by: monodUw
Category:

less

Transcript and Presenter's Notes

Title: Chapter 2. Protein identification from MSMS data


1
Chapter 2. Protein identification from MS/MS data
2
  • The most beautiful theory is the one that is from
    and applied to real life.

3
From Genome to Drugs
Post Genome Gene/Protein function
analysis Protein identification and analysis by
Mass Spec and Microarrays
ADMET ADMET prediction
  • Drug Discovery
  • Computational chemistry
  • Protein 3D structure prediction
  • Receptor-ligand analysis

4
Flow of Biological Information
DNA
mRNA
Proteins
Modified Proteins
Genome
Transcriptome
Proteome

Proteome entire protein complement
expressed by a genome or by a cell or tissue
type
5
(No Transcript)
6
(No Transcript)
7
Challenge of Proteomics
Much more complex than genomics proteins are
expressed at different levels,
different times different forms
(20 or more)   Yeast 6,127
genes 2,000 of unknown function   C. elegans
(worm) 19,000 genes   Human genome
30-35,000 genes 20,000(?)of unknown
function  
8
Goals of Proteomics
  - Identification of each protein   - Levels
of protein expression do not always
correlate with mRNA levels -
Post-translational modifications (PTMs)
types and sites - Protein/protein
Interactions molecular machines   -
Protein localization (organellar proteomics, cell
map) - Protein function in normal, stressed
and diseases states
9
Challenges of Analyzing Proteins
Size From 50 to 100,000 a.a.
5,000-1,000,000 Da (1 Dalton 1.67x10-27kg)
Diversity in human cells 30,000
different sequences
Relative abundance broad dynamic range
10-1,000,000 copies /per cell
Different forms Post-translational modifications

93 forms of tau proteins!!
10
Proteomics Science Market
  • There two key proteomics technology for
    identifying proteins in cells
  • Microarrays (cheap), at the mRNA level
  • Mass Spectrometers () at the protein level.
  • Importance of mass spec technology
  • Not all mRNAs are translated
  • Different levels of expression
  • PTMs (Post Translation Modifications)
  • Mass Spec Market size 2 billion / year. ½ for
    protein purposes. Main manufacturers Micromass,
    ABI, MDS Sciex, Bruker, Thermo-Finnegan.

11
Chemical Composition of Living Matter
27 of 92 natural elements are essential.   Eleme
nts in biomolecules (organic matter) H, C, N,
O, P, S These elements represent approximately
92 of dry weight.   Elements occurring as
ions Na, K, Mg, Cl-, Ca   Trace
elements B, F, Al, Si, V, Cr, Mn, Fe, Co, Ni,
Cu, Zn, Se, Mo, Sn, I Fe involved in transport
of oxygen. Cu, Zn important at the active site
of certain enzymes.  
12
Organic Matter
Derivatives of carbon (61 of dry weight), high
chemical versatility Single bonds C-H, C-C, C-O,
C-N, C-S   Double bonds CC, CO, CN Triple
bonds CC, CN (rare in biomolecules)   Often
organized in "building blocks" amino acids
polypeptides ( proteins) monosaccharides starch
, glycogen nucleic acids DNA, RNA
13
Mass (Weights) of atoms and molecules
  element nominal exact Percent average
mass mass abundance mass   C 12 12.00000
98.9 13 13.00335 1.1 12.00115  H
1 1.00783 99.98 2 2.0140
0.02 1.008665  O 16 15.99491 99.8 15.994 18
17.9992 0.02  N 14 14.00307 99.63 14.006
7 15 15.00011 0.37  S 32 31.97207 95.0
33 32.97147 0.76 32.064 34 33.96786 4.22
 P 31 30.9738 100  
14
Amino Acids (20)
MW before losing water
Exact Mass of Amino Acid Residues in Proteins
Note Leu (L) Ile (I) 113.08410
15
A table of amino acid masses
16
To identify the amino acid sequence of a protein
from a tissue, the tissue is first reduced to a
fraction, typically an organelle. This organelle
may contain 500 2000 different proteins. There
are many copies of each protein. The fraction is
then run on a 1D or 2D gel to separate the
proteins further. The 2D gel separates the
proteins by mass and charge. Typically, each
spot in the 2D gel contains many copies of one
protein. Each spot is excised from the gel and
digested with trypsin. Trypsin cleaves the
protein at every occurrence of the amino acid
Arginine (R) and Lysine (K). The resulting
protein pieces are called peptides.
17
Typical 2D Gel
18
Cleavage with trypsin (tryptic digestion)
Trypsin cleaves at only at peptide bond Rn-1
Lys, Arg Rn ¹ Pro
Xxx-Xxx-Xxx-Xxx-Xxx-Xxx-Xxx-Xxx-
Rn-1 Rn R n1
Example
Trypsin
Trypsin
?
?
Arg - Gly - Phe - Lys - Ile - Ala - Glu - Trp -
Met
MW (Average mass) 1136
Treatment with trypsin gives 3 fragments  1. Arg
(MW 174)   2. Gly - Phe Lys (MW
57147128 18 350)   3. Ile-Ala-Glu-Trp-Met
(MW 11371129 186 18 648
Note each internal peptide will end with Lys (K)
or Arg (R)
19
Tools for Protein Identification
Mass Spectrometry
Two ionization techniques
Matrix Assisted Laser Desorption/Ionization
(MALDI)
Electrospray Ionization (ESI)
Both with many types of mass analyzers TOF,
Quadrupole (Q), Q-TOF, FT ICR MS, Q-ITOF, etc.
Enzyme Digestion To get smaller fragments
Trypsin (99), LysC, others
20
The Mass Spectrometer
MALDI ESI
Time-of-flight Quadrupole Ion trap
21
Mass spectrometer
detector
ionization
mass analyzer
sample
ions
mass spectrum
intensity
m/z
22
MS fingerprint for proteins
23
MS fingerprint for protein
protein
MPSESSYKVHRPAKSGGS
trypsin digestion
MPSESSYK
VHR
PAK
SGGS
peptides
24
A real spectrum for hemoglobin
25
Search a database for match
MPSESSYKVHRPAKSGGS
another protein

in-silicon digestion
in-silicon digestion
Theoretical Spectra


Real Spectrum
26
Mascot interface
27
Too many matches
  • There are many peptides in the database with the
    same mass.
  • There are many missed peaks in the MS.
  • There are plenty of noises in the MS.
  • For each MS, there could be many proteins in the
    database that match the MS.

28
Tandem MS
29
Tandem MS
detector
mass analyzer
mass analyzer
fragmentation
ions
30
Tandem MS
MPSESSYKVHRPAKSGGS
Tryptic digestion
MPSESSYK
VHR
PAK
SGGS
fragmentation MS
fragmentation MS
31
How does a peptide fragment?
m(b1)1m(A1) m(b2)1m(A1)m(A2) m(b3)1m(A1)m(
A2)m(A3)
m(y1)19m(A4) m(y2)19m(A4)m(A3) m(y3)19m(A4)
m(A3)m(A2)
C terminal
N terminal
Note C terminal has extra H2O, 18 daltons. a ion
mass is b ion mass -28, c ion mass is b ion mass
15
32
Peptide sequencing
  • A protein/peptide sequence consists of 20
    different types amino acids.
  • Most amino acids have distinct masses.
  • Different peptide (400-2000 Daltons) sequences
    will produce different MS/MS spectra.
  • Peptide sequencing
  • Preprocess
  • Infer a.a. sequence de novo or by database
  • Deal with PTM

33
Protein identification after peptide sequencing
  • Search a database to find the protein that
    contains the most peptides.
  • From the peptide sequences (or partial
    sequences), design primers to PCR the region of
    the gene that encodes the protein, then do a DNA
    sequence of the gene.
  • Assemble the peptide sequences together
  • not very realistic because the loss of ions.

34
The MS/MS spectrum
N-term
C-term
35
More ions
  • Each N-term ion might lose an ammonium
  • a-NH3, b- NH3, c- NH3
  • Each C-term ion might lose a water
  • x-H2O, y-H2O, z-H2O,
  • An ion sometimes is doubly/triply charged
  • m/z value is halved.
  • One peptide can be fragmented twice
  • internal cleavages, imonium ions.

36
A real MS/MS spectrum with good quality
LGSSEVEQVQLVVDGVK
37
Preprocessing
38
Raw data
39
Preprocess
40
Noise subtraction
  • Estimate a baseline and subtract the baseline
    value from each peak.
  • constant baseline
  • linear baseline
  • quadratic baseline

41
Isotopic ions
  • There are two isotopes of carbons. 98.9 of
    carbon atoms in the nature is C-12, whose mass is
    12 dalton. 1.1 of carbon is C-13, whose mass is
    13 dalton.
  • Because an ion may have 0,1,2,,k C-13s, its
    mass can be m, m1, m2, , mk.
  • The peak at mi is called isotopic ion peaks.
  • Are isotopic peaks very low?

42
Isotopic ions
  • Each amino acid has several carbons, and each ion
    has several amino acids. If there are 50
    carbons, the probability to have more than one
    carbon is
  • 1-(0.989)50 0.424
  • In this example, the sum of intensities of the
    isotopic ion peaks is at high as the
    mono-isotopic peak.

43
Multiply charged ions
  • An ion with mass m can have more than one charge
    (proton). If z protons are attached, the m/z
    value is (mz)/z.
  • How do we determine the charge state of a peak?
  • Using isotopic ion peaks.

44
Multiply charged ions
  • The isotopic ions have masses m, m1, m2
  • Therefore, the m/z values are
  • m/z, (m1)/z,
  • The m/z difference of the isotopic ions can tell
    the charge state.

45
Deconvolution
  • Deconvolution is the procedure to convert
    multiply charged peaks to singly charged, and sum
    up all the isotopic ion peaks.

46
Other ion types
  • http//www.matrixscience.com/help/fragmentation_he
    lp.html

47
Database search methods
  • Mascot
  • SEQUEST

48
Database search methods
  • For each possible peptide in the database,
  • do an in-silicon fragmentation to compute the
    theoretical spectrum
  • compare the spectrum with the real spectrum,
    compute a matching score this is key.
  • Output the peptide with the best matching score.

49
Mascot
50
Weakness of database method
  • Does not work for unknown proteins
  • Trouble with PTMs

51
De novo sequencing
52
Spectrum Graph Approach
  • Each node of the graph represents a peak in the
    spectrum.
  • Two nodes have an edge iff the two corresponding
    peaks are distanced with the mass of an amino
    acid.
  • The path that connects the two ends corresponds
    to a feasible solution.

53
Weakness of Spectrum Graph
  • Missing ions causes no solution.
  • Add new edges that connect two peaks with
    distance equal to the total mass of two (or more)
    amino acids.
  • Noise causes too many feasible paths.
  • Add weights to the nodes and edges. Find the
    longest path.

54
References
  • Dancík, V., Addona, T., Clauser, K., Vath, J.,
    and Pevzner, P. 1999. De novo protein sequencing
    via tandem mass-spectrometry. J. Computational
    Biology 6, 327-341.
  • Chen, T., Kao, M-Y., Tepel, M., Rush J., and
    Church, G. 2001. A dynamic programming approach
    to de novo peptide sequencing via tandem mass
    spectrometry. Journal of Computational Biology
    8(3), 325-337.
  • Lutefisk
  • Taylor, J.A., and Johnson, R.S. 1997. Rapid
    Commun. Mass Spectrom. 11, 1067-1075.
  • Taylor, J.A., and Johnson, R.S. 2001. Anal. Chem.
    73, 2594 - 2604.

55
The Sandwich Algorithm
B. Ma, K. Zhang, C. Liang An efficient algorithm
for peptide de novo sequencing from MS/MS
spectrum, CPM03.
56
Ions
  • If Pa1a2 ak, mass of P is denoted as
  • P Sj1..k aj
  • But the actual mass would be P18 because
    of extra H2O at the C terminal end.
  • When cut, b ions actual mass is its mass plus 1
    (proton), and y ions actual mass is its mass
    plus 19 (18 plus 1 proton).
  • Hence, yk-i bi 20P, for each i.

57
If only consider y-ions
  • Let the sequence be a1a2an , then the y-ions
    are
  • 19m(an), 19m(an)m(an-1),
  • The matching score that a1a2an matches the
    experimental spectrum is defined to be
  • the sum of all the intensities of the peaks
    around the y-ion positions.
  • De novo sequencing
  • Find the sequence that maximizes the matching
    score.

58
Matching score
R
V
L
L
A
N
Q
F
G
Y
E
G
L
59
Dynamic programming solution
  • Define DPx to be the maximum matching score
    caused by a suffix with y-ion mass x.

60
Dynamic programming solution
  • If x0, DPx 0
  • If x0,

1. Let m be the peptide mass 2. for x from 0 to
m, stepsize ? DPxh(x) max
DPx-m(a) 3. find the sequence of amino acids
that leads to DPm(backtracking), output the
sequence
61
If consider both b and y ions
  • Let DPx,y be the matching score caused by a
    length x prefix and a length y suffix. Then
  • If xy,
  • If yx,

62
Formalize
  • Let Aa1ak. If A is a b-ion, we denote its mass
    with
  • Ab1Si1..k ai.
  • If A is a y-ion, we denote its mass with
  • Ay19Si1..k ai.
  • If x is the mass of a b-ion, the related a-ion,
    c-ion are x-28 and x17, losing water is -18, and
    losing ammonia is x-17. Denote this set as
  • B(x)x-28,x-18,x-17,x,x17
  • Similarly if x is mass of a y-ion, its related
    ion set is
  • Y(x)x-18,x-17,x,x26
  • If P is a peptide of length n, theoretical peaks
    are
  • S(P) Ui1n-1 B(bi) U Y(yi)
  • For set S, spectrum M, let the set of matched
    peaks in S be
  • S (xi,, hi) e M there is y in S
    s.t. y-xid
  • Given S, h(S)S(x,h) in S h

63
De novo Sequencing Problem
  • Input M(xi,hi)i1,n, M (P20), error d.
  • Construct P such that P20-M d, h(S(P))
    is maximized.

64
Chummy Pairs
  • Let Pa1a2 an.
  • For prefix Aa1ak, set of ions caused by A
  • SN(A)Ui1..kB(a1aib) U
    Y(M-a1aib)
  • For suffix Aan-k1an, set of ions caused by A
  • SC(A)Ui1..kY(an-i1any) U
    B(M-an-i1any)
  • If PAaA, then S(P)SN(A) U SC(A)
  • Definition. (A,A) is called a chummy pair if
  • Ab Ay
  • and either of the following holds
  • A1 A-1b (7)
  • A2 Ay Ay (8)

65
The Lemmas
  • Lemma 1. Let (A,A) be a chummy pair and
  • f(u,v,w)hB(u) U Y(M-u) B(v) U Y(M-v) U
    Y(w) U B(M-w), Then
  • (i) if (Aa,A) is a chummy pair, then
  • h(SN(Aa) U SC(A))h(SN(A) U SC(A))
    f(Aab, Ab , Ay)
  • (ii) If (A,aA) is a chummy pair, then
  • h(SN(A) U SC(aA)h(SN(A) U SC(A))
    f(M-aAy , M-Ay, M-Ab)
  • Proof.
  • To prove (i), by definition,
  • SN(Aa) U SC(A) SN(A) U SC(A) U B(Aab)
    U Y(M-Aab). (13)
  • By chummy pair definitions, and AaAwe have
  • B(Aab) n SN(A1A-1) U SC(A2A) F
  • Y(M-Aab) n SN(A1A-1) U SC(A2A)
    F
  • For the above two formulas, we have
  • F SN(A) U SC(A) n B(Aab) U
    Y(M-Aab) B(Ab) U Y(M-Ab) U
    Y(Ay) U B(M-Ay)
    (14)
  • (13)(14) prove (i). The proof of (ii) is
    similar.

66
  • Lemma 2. Let (A,A) be a chummy pair and Ab
    Ay a
  • (i) If Abpair and (A,aA) is not
  • (ii) If Ay Ab, then (A,aA) is a
    chummy pair and (Aa,A) is not.
  • Proof. Draw pictures using definition.
  • Lemma 3. Let (A,A) be a chummy pair. Then
    exactly one of (A1A-1,A) and (A,A2A) is
    a chummy pair.
  • Proof. Draw pictures using definition, (7)(8).
  • Lemma 4. Let P be the optimal solution. Then
    there is a chummy pair (A,A) and an amino acid a
    so that PAaA.
  • Proof. Let Pa1a2am. By Lemma 2, we find (A,A)
    by
  • 1. Let A and A be the empty string
  • 2. for i1 to m-1 do
  • if Ab
  • else Aam-AA.
  • Now by Lemma 4 and S(P)SN(A) U SC(A), for
    PAaA, to find an optimal solution we look for a
    chummy pair s.t.
  • There is a letter a AbAya-M d
  • h(SN(A) U SC(A)) is maximized.
  • Let DP(x,y) be the max value of h(SN(A) U SC(A))
    for all chummy pairs such that Abx and
    Ayy.

67
Algorithm Sandwich
  • Input Peak list M, mass value M, error bound d,
    calibration ?
  • Output Peptide P s.t. h(S(P)) is max and
    P20-Md.
  • Initialize DPi,j-8 DP1,190
  • for x1 to M/2maxa step ? do
  • for yx-maxa to min(xmaxa, M-x)
    step ? do
  • for a in S
  • if x
  • DPxa,ymaxDPxa,y,
    DPx,yf(xa,x,y)
  • else
  • DPx,yamaxDPx,ya,
    DPx,yf(M-y-a,M-y,M-x)
  • Compute the best DPx,y for all x,y,a satisfying
    xya -M d.
  • Compute the best A, A and a by backtracking,
    output AaA.
  • Note max in above DP is needed, as xa may
    equal x?a

68
PEAKS vs Lutefisk with 3rd party data
Red means correct. PEAKS is 20 times
faster.
69
Side Topic Biomarkers from mass spec data
  • SELDI
  • Surface Enhanced Laser Desorption/Ionization. It
    combines chromatography with TOF-MS.
  • Advantages of SELDI technology
  • Uses small amounts (sample (biopsies, microdissected tissue).
  • Quickly obtain protein mapping from multiple
    samples at same conditions.
  • Ideal for discovering biomarkers quickly.

70
ProteinChip Arrays
71
SELDI Process
copy from http//www.bmskorea.co.kr/new01_21-1.htm
72
Protein mapping
C
C
N
N
C
73
Biomarker Discovery
  • Markers can be easily found by comparing protein
    maps.
  • SELDI is faster and more reproducible than 2D
    PAGE.
  • Has been being used to discover protein
    biomarkers of diseases such as ovarian cancer,
    breast cancer, prostate and bladder cancers.

(Normal)
 
(Cancer)
Modified from Ciphergen Web Site)
74
Inferencing biomarkers
  • Inference using SVM
  • Decision list
  • Other learning tools

75
Assignment 2
  • Prove Lemma 1 case (ii). (note the change, Lemma
    2 is changed to 1)
  • Give a complete proof for Lemma 3.
  • Can you improve the sandwich algorithm to take
    internal fragmentation ions into consideration?

76
Term Project Options
  • Investigate the de novo sequencing problem when
    there are internal fragmentations and
    post-translations modifications (PTMs) in the
    peptide.
  • Biomarker inference algorithms (need to use real
    data here, download from web).
  • When two peptides are in the same spectrum,
    design a good algorithm to separate and sequence
    them.
  • Study good ways to deal with PTMs in database
    search, avoiding exponential growth in number of
    PTMs.

77
Open Questions
  • Polynomial time de novo sequencing algorithm with
    internal fragmentations and PTMs.
  • Effective de novo sequencing algorithm for low
    grade (ion trap) data.
  • Less important but cute Given (peptide) mass M,
    find a segment in protein database with mass M,
    in time shorter than O(n/logn). (Mark Cieliebak
    question) Preprocessing allowed, but should not
    use too much memory.

78
Acknowledgement
  • Thanks to Tim Guo, Gilles Lajoie and Bin Ma ---
    some materials are adapted from their notes and
    papers.
Write a Comment
User Comments (0)
About PowerShow.com