Title: Chapter 2. Protein identification from MSMS data
1 Chapter 2. Protein identification from MS/MS data
2- The most beautiful theory is the one that is from
and applied to real life.
3From Genome to Drugs
Post Genome Gene/Protein function
analysis Protein identification and analysis by
Mass Spec and Microarrays
ADMET ADMET prediction
- Drug Discovery
- Computational chemistry
- Protein 3D structure prediction
- Receptor-ligand analysis
4Flow of Biological Information
DNA
mRNA
Proteins
Modified Proteins
Genome
Transcriptome
Proteome
Proteome entire protein complement
expressed by a genome or by a cell or tissue
type
5(No Transcript)
6(No Transcript)
7Challenge of Proteomics
Much more complex than genomics proteins are
expressed at different levels,
different times different forms
(20 or more) Yeast 6,127
genes 2,000 of unknown function C. elegans
(worm) 19,000 genes Human genome
30-35,000 genes 20,000(?)of unknown
function
8Goals of Proteomics
- Identification of each protein - Levels
of protein expression do not always
correlate with mRNA levels -
Post-translational modifications (PTMs)
types and sites - Protein/protein
Interactions molecular machines -
Protein localization (organellar proteomics, cell
map) - Protein function in normal, stressed
and diseases states
9Challenges of Analyzing Proteins
Size From 50 to 100,000 a.a.
5,000-1,000,000 Da (1 Dalton 1.67x10-27kg)
Diversity in human cells 30,000
different sequences
Relative abundance broad dynamic range
10-1,000,000 copies /per cell
Different forms Post-translational modifications
93 forms of tau proteins!!
10Proteomics Science Market
- There two key proteomics technology for
identifying proteins in cells - Microarrays (cheap), at the mRNA level
- Mass Spectrometers () at the protein level.
- Importance of mass spec technology
- Not all mRNAs are translated
- Different levels of expression
- PTMs (Post Translation Modifications)
- Mass Spec Market size 2 billion / year. ½ for
protein purposes. Main manufacturers Micromass,
ABI, MDS Sciex, Bruker, Thermo-Finnegan.
11Chemical Composition of Living Matter
27 of 92 natural elements are essential. Eleme
nts in biomolecules (organic matter) H, C, N,
O, P, S These elements represent approximately
92 of dry weight. Elements occurring as
ions Na, K, Mg, Cl-, Ca Trace
elements B, F, Al, Si, V, Cr, Mn, Fe, Co, Ni,
Cu, Zn, Se, Mo, Sn, I Fe involved in transport
of oxygen. Cu, Zn important at the active site
of certain enzymes.
12Organic Matter
Derivatives of carbon (61 of dry weight), high
chemical versatility Single bonds C-H, C-C, C-O,
C-N, C-S Double bonds CC, CO, CN Triple
bonds CC, CN (rare in biomolecules) Often
organized in "building blocks" amino acids
polypeptides ( proteins) monosaccharides starch
, glycogen nucleic acids DNA, RNA
13Mass (Weights) of atoms and molecules
element nominal exact Percent average
mass mass abundance mass C 12 12.00000
98.9 13 13.00335 1.1 12.00115 H
1 1.00783 99.98 2 2.0140
0.02 1.008665 O 16 15.99491 99.8 15.994 18
17.9992 0.02 N 14 14.00307 99.63 14.006
7 15 15.00011 0.37 S 32 31.97207 95.0
33 32.97147 0.76 32.064 34 33.96786 4.22
P 31 30.9738 100
14Amino Acids (20)
MW before losing water
Exact Mass of Amino Acid Residues in Proteins
Note Leu (L) Ile (I) 113.08410
15A table of amino acid masses
16To identify the amino acid sequence of a protein
from a tissue, the tissue is first reduced to a
fraction, typically an organelle. This organelle
may contain 500 2000 different proteins. There
are many copies of each protein. The fraction is
then run on a 1D or 2D gel to separate the
proteins further. The 2D gel separates the
proteins by mass and charge. Typically, each
spot in the 2D gel contains many copies of one
protein. Each spot is excised from the gel and
digested with trypsin. Trypsin cleaves the
protein at every occurrence of the amino acid
Arginine (R) and Lysine (K). The resulting
protein pieces are called peptides.
17Typical 2D Gel
18Cleavage with trypsin (tryptic digestion)
Trypsin cleaves at only at peptide bond Rn-1
Lys, Arg Rn ¹ Pro
Xxx-Xxx-Xxx-Xxx-Xxx-Xxx-Xxx-Xxx-
Rn-1 Rn R n1
Example
Trypsin
Trypsin
?
?
Arg - Gly - Phe - Lys - Ile - Ala - Glu - Trp -
Met
MW (Average mass) 1136
Treatment with trypsin gives 3 fragments 1. Arg
(MW 174) 2. Gly - Phe Lys (MW
57147128 18 350) 3. Ile-Ala-Glu-Trp-Met
(MW 11371129 186 18 648
Note each internal peptide will end with Lys (K)
or Arg (R)
19Tools for Protein Identification
Mass Spectrometry
Two ionization techniques
Matrix Assisted Laser Desorption/Ionization
(MALDI)
Electrospray Ionization (ESI)
Both with many types of mass analyzers TOF,
Quadrupole (Q), Q-TOF, FT ICR MS, Q-ITOF, etc.
Enzyme Digestion To get smaller fragments
Trypsin (99), LysC, others
20The Mass Spectrometer
MALDI ESI
Time-of-flight Quadrupole Ion trap
21Mass spectrometer
detector
ionization
mass analyzer
sample
ions
mass spectrum
intensity
m/z
22MS fingerprint for proteins
23MS fingerprint for protein
protein
MPSESSYKVHRPAKSGGS
trypsin digestion
MPSESSYK
VHR
PAK
SGGS
peptides
24A real spectrum for hemoglobin
25Search a database for match
MPSESSYKVHRPAKSGGS
another protein
in-silicon digestion
in-silicon digestion
Theoretical Spectra
Real Spectrum
26Mascot interface
27Too many matches
- There are many peptides in the database with the
same mass. - There are many missed peaks in the MS.
- There are plenty of noises in the MS.
- For each MS, there could be many proteins in the
database that match the MS.
28Tandem MS
29Tandem MS
detector
mass analyzer
mass analyzer
fragmentation
ions
30Tandem MS
MPSESSYKVHRPAKSGGS
Tryptic digestion
MPSESSYK
VHR
PAK
SGGS
fragmentation MS
fragmentation MS
31How does a peptide fragment?
m(b1)1m(A1) m(b2)1m(A1)m(A2) m(b3)1m(A1)m(
A2)m(A3)
m(y1)19m(A4) m(y2)19m(A4)m(A3) m(y3)19m(A4)
m(A3)m(A2)
C terminal
N terminal
Note C terminal has extra H2O, 18 daltons. a ion
mass is b ion mass -28, c ion mass is b ion mass
15
32Peptide sequencing
- A protein/peptide sequence consists of 20
different types amino acids. - Most amino acids have distinct masses.
- Different peptide (400-2000 Daltons) sequences
will produce different MS/MS spectra. - Peptide sequencing
- Preprocess
- Infer a.a. sequence de novo or by database
- Deal with PTM
33Protein identification after peptide sequencing
- Search a database to find the protein that
contains the most peptides. - From the peptide sequences (or partial
sequences), design primers to PCR the region of
the gene that encodes the protein, then do a DNA
sequence of the gene. - Assemble the peptide sequences together
- not very realistic because the loss of ions.
34The MS/MS spectrum
N-term
C-term
35More ions
- Each N-term ion might lose an ammonium
- a-NH3, b- NH3, c- NH3
- Each C-term ion might lose a water
- x-H2O, y-H2O, z-H2O,
- An ion sometimes is doubly/triply charged
- m/z value is halved.
- One peptide can be fragmented twice
- internal cleavages, imonium ions.
36A real MS/MS spectrum with good quality
LGSSEVEQVQLVVDGVK
37Preprocessing
38Raw data
39Preprocess
40Noise subtraction
- Estimate a baseline and subtract the baseline
value from each peak. - constant baseline
- linear baseline
- quadratic baseline
41Isotopic ions
- There are two isotopes of carbons. 98.9 of
carbon atoms in the nature is C-12, whose mass is
12 dalton. 1.1 of carbon is C-13, whose mass is
13 dalton. - Because an ion may have 0,1,2,,k C-13s, its
mass can be m, m1, m2, , mk. - The peak at mi is called isotopic ion peaks.
- Are isotopic peaks very low?
42Isotopic ions
- Each amino acid has several carbons, and each ion
has several amino acids. If there are 50
carbons, the probability to have more than one
carbon is - 1-(0.989)50 0.424
- In this example, the sum of intensities of the
isotopic ion peaks is at high as the
mono-isotopic peak.
43Multiply charged ions
- An ion with mass m can have more than one charge
(proton). If z protons are attached, the m/z
value is (mz)/z. - How do we determine the charge state of a peak?
- Using isotopic ion peaks.
44Multiply charged ions
- The isotopic ions have masses m, m1, m2
- Therefore, the m/z values are
- m/z, (m1)/z,
- The m/z difference of the isotopic ions can tell
the charge state.
45Deconvolution
- Deconvolution is the procedure to convert
multiply charged peaks to singly charged, and sum
up all the isotopic ion peaks.
46Other ion types
- http//www.matrixscience.com/help/fragmentation_he
lp.html
47Database search methods
48Database search methods
- For each possible peptide in the database,
- do an in-silicon fragmentation to compute the
theoretical spectrum - compare the spectrum with the real spectrum,
compute a matching score this is key. - Output the peptide with the best matching score.
49Mascot
50Weakness of database method
- Does not work for unknown proteins
- Trouble with PTMs
51De novo sequencing
52Spectrum Graph Approach
- Each node of the graph represents a peak in the
spectrum. - Two nodes have an edge iff the two corresponding
peaks are distanced with the mass of an amino
acid. - The path that connects the two ends corresponds
to a feasible solution.
53Weakness of Spectrum Graph
- Missing ions causes no solution.
- Add new edges that connect two peaks with
distance equal to the total mass of two (or more)
amino acids. - Noise causes too many feasible paths.
- Add weights to the nodes and edges. Find the
longest path.
54References
- Dancík, V., Addona, T., Clauser, K., Vath, J.,
and Pevzner, P. 1999. De novo protein sequencing
via tandem mass-spectrometry. J. Computational
Biology 6, 327-341. - Chen, T., Kao, M-Y., Tepel, M., Rush J., and
Church, G. 2001. A dynamic programming approach
to de novo peptide sequencing via tandem mass
spectrometry. Journal of Computational Biology
8(3), 325-337. - Lutefisk
- Taylor, J.A., and Johnson, R.S. 1997. Rapid
Commun. Mass Spectrom. 11, 1067-1075. - Taylor, J.A., and Johnson, R.S. 2001. Anal. Chem.
73, 2594 - 2604.
55The Sandwich Algorithm
B. Ma, K. Zhang, C. Liang An efficient algorithm
for peptide de novo sequencing from MS/MS
spectrum, CPM03.
56Ions
- If Pa1a2 ak, mass of P is denoted as
- P Sj1..k aj
- But the actual mass would be P18 because
of extra H2O at the C terminal end. - When cut, b ions actual mass is its mass plus 1
(proton), and y ions actual mass is its mass
plus 19 (18 plus 1 proton). - Hence, yk-i bi 20P, for each i.
57If only consider y-ions
- Let the sequence be a1a2an , then the y-ions
are - 19m(an), 19m(an)m(an-1),
- The matching score that a1a2an matches the
experimental spectrum is defined to be - the sum of all the intensities of the peaks
around the y-ion positions. - De novo sequencing
- Find the sequence that maximizes the matching
score.
58Matching score
R
V
L
L
A
N
Q
F
G
Y
E
G
L
59Dynamic programming solution
- Define DPx to be the maximum matching score
caused by a suffix with y-ion mass x.
60Dynamic programming solution
1. Let m be the peptide mass 2. for x from 0 to
m, stepsize ? DPxh(x) max
DPx-m(a) 3. find the sequence of amino acids
that leads to DPm(backtracking), output the
sequence
61If consider both b and y ions
- Let DPx,y be the matching score caused by a
length x prefix and a length y suffix. Then - If xy,
- If yx,
62Formalize
- Let Aa1ak. If A is a b-ion, we denote its mass
with - Ab1Si1..k ai.
- If A is a y-ion, we denote its mass with
- Ay19Si1..k ai.
- If x is the mass of a b-ion, the related a-ion,
c-ion are x-28 and x17, losing water is -18, and
losing ammonia is x-17. Denote this set as - B(x)x-28,x-18,x-17,x,x17
- Similarly if x is mass of a y-ion, its related
ion set is - Y(x)x-18,x-17,x,x26
- If P is a peptide of length n, theoretical peaks
are - S(P) Ui1n-1 B(bi) U Y(yi)
- For set S, spectrum M, let the set of matched
peaks in S be - S (xi,, hi) e M there is y in S
s.t. y-xid - Given S, h(S)S(x,h) in S h
63De novo Sequencing Problem
- Input M(xi,hi)i1,n, M (P20), error d.
- Construct P such that P20-M d, h(S(P))
is maximized.
64Chummy Pairs
- Let Pa1a2 an.
- For prefix Aa1ak, set of ions caused by A
- SN(A)Ui1..kB(a1aib) U
Y(M-a1aib) - For suffix Aan-k1an, set of ions caused by A
- SC(A)Ui1..kY(an-i1any) U
B(M-an-i1any) - If PAaA, then S(P)SN(A) U SC(A)
- Definition. (A,A) is called a chummy pair if
- Ab Ay
- and either of the following holds
- A1 A-1b (7)
- A2 Ay Ay (8)
65The Lemmas
- Lemma 1. Let (A,A) be a chummy pair and
- f(u,v,w)hB(u) U Y(M-u) B(v) U Y(M-v) U
Y(w) U B(M-w), Then - (i) if (Aa,A) is a chummy pair, then
- h(SN(Aa) U SC(A))h(SN(A) U SC(A))
f(Aab, Ab , Ay) - (ii) If (A,aA) is a chummy pair, then
- h(SN(A) U SC(aA)h(SN(A) U SC(A))
f(M-aAy , M-Ay, M-Ab) - Proof.
- To prove (i), by definition,
- SN(Aa) U SC(A) SN(A) U SC(A) U B(Aab)
U Y(M-Aab). (13) - By chummy pair definitions, and AaAwe have
- B(Aab) n SN(A1A-1) U SC(A2A) F
- Y(M-Aab) n SN(A1A-1) U SC(A2A)
F - For the above two formulas, we have
- F SN(A) U SC(A) n B(Aab) U
Y(M-Aab) B(Ab) U Y(M-Ab) U
Y(Ay) U B(M-Ay)
(14) - (13)(14) prove (i). The proof of (ii) is
similar.
66- Lemma 2. Let (A,A) be a chummy pair and Ab
Ay a - (i) If Abpair and (A,aA) is not
- (ii) If Ay Ab, then (A,aA) is a
chummy pair and (Aa,A) is not. - Proof. Draw pictures using definition.
- Lemma 3. Let (A,A) be a chummy pair. Then
exactly one of (A1A-1,A) and (A,A2A) is
a chummy pair. - Proof. Draw pictures using definition, (7)(8).
- Lemma 4. Let P be the optimal solution. Then
there is a chummy pair (A,A) and an amino acid a
so that PAaA. - Proof. Let Pa1a2am. By Lemma 2, we find (A,A)
by - 1. Let A and A be the empty string
- 2. for i1 to m-1 do
- if Ab
- else Aam-AA.
- Now by Lemma 4 and S(P)SN(A) U SC(A), for
PAaA, to find an optimal solution we look for a
chummy pair s.t. - There is a letter a AbAya-M d
- h(SN(A) U SC(A)) is maximized.
- Let DP(x,y) be the max value of h(SN(A) U SC(A))
for all chummy pairs such that Abx and
Ayy.
67Algorithm Sandwich
- Input Peak list M, mass value M, error bound d,
calibration ? - Output Peptide P s.t. h(S(P)) is max and
P20-Md. - Initialize DPi,j-8 DP1,190
- for x1 to M/2maxa step ? do
- for yx-maxa to min(xmaxa, M-x)
step ? do - for a in S
- if x
- DPxa,ymaxDPxa,y,
DPx,yf(xa,x,y) - else
- DPx,yamaxDPx,ya,
DPx,yf(M-y-a,M-y,M-x) - Compute the best DPx,y for all x,y,a satisfying
xya -M d. - Compute the best A, A and a by backtracking,
output AaA. - Note max in above DP is needed, as xa may
equal x?a
68PEAKS vs Lutefisk with 3rd party data
Red means correct. PEAKS is 20 times
faster.
69Side Topic Biomarkers from mass spec data
- SELDI
- Surface Enhanced Laser Desorption/Ionization. It
combines chromatography with TOF-MS. - Advantages of SELDI technology
- Uses small amounts (sample (biopsies, microdissected tissue).
- Quickly obtain protein mapping from multiple
samples at same conditions. - Ideal for discovering biomarkers quickly.
70ProteinChip Arrays
71SELDI Process
copy from http//www.bmskorea.co.kr/new01_21-1.htm
72 Protein mapping
C
C
N
N
C
73Biomarker Discovery
- Markers can be easily found by comparing protein
maps. - SELDI is faster and more reproducible than 2D
PAGE. - Has been being used to discover protein
biomarkers of diseases such as ovarian cancer,
breast cancer, prostate and bladder cancers.
(Normal)
(Cancer)
Modified from Ciphergen Web Site)
74Inferencing biomarkers
- Inference using SVM
- Decision list
- Other learning tools
75Assignment 2
- Prove Lemma 1 case (ii). (note the change, Lemma
2 is changed to 1) - Give a complete proof for Lemma 3.
- Can you improve the sandwich algorithm to take
internal fragmentation ions into consideration?
76Term Project Options
- Investigate the de novo sequencing problem when
there are internal fragmentations and
post-translations modifications (PTMs) in the
peptide. - Biomarker inference algorithms (need to use real
data here, download from web). - When two peptides are in the same spectrum,
design a good algorithm to separate and sequence
them. - Study good ways to deal with PTMs in database
search, avoiding exponential growth in number of
PTMs.
77Open Questions
- Polynomial time de novo sequencing algorithm with
internal fragmentations and PTMs. - Effective de novo sequencing algorithm for low
grade (ion trap) data. - Less important but cute Given (peptide) mass M,
find a segment in protein database with mass M,
in time shorter than O(n/logn). (Mark Cieliebak
question) Preprocessing allowed, but should not
use too much memory.
78Acknowledgement
- Thanks to Tim Guo, Gilles Lajoie and Bin Ma ---
some materials are adapted from their notes and
papers.