Chapter 2. Protein identification from MSMS data - PowerPoint PPT Presentation

1 / 78

About This Presentation

Title:

Chapter 2. Protein identification from MSMS data

Description:

Chapter 2. Protein identification from MSMS data – PowerPoint PPT presentation

Number of Views:107

Avg rating:3.0/5.0

Slides: 79

Provided by: monodUw

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 2. Protein identification from MSMS data

1
Chapter 2. Protein identification from MS/MS data
2

The most beautiful theory is the one that is from
and applied to real life.

3
From Genome to Drugs
Post Genome Gene/Protein function
analysis Protein identification and analysis by
Mass Spec and Microarrays
ADMET ADMET prediction

Drug Discovery
Computational chemistry
Protein 3D structure prediction
Receptor-ligand analysis

4
Flow of Biological Information
DNA
mRNA
Proteins
Modified Proteins
Genome
Transcriptome
Proteome

Proteome entire protein complement
expressed by a genome or by a cell or tissue
type
5
(No Transcript)
6
(No Transcript)
7
Challenge of Proteomics
Much more complex than genomics proteins are
expressed at different levels,
different times different forms
(20 or more) Yeast 6,127
genes 2,000 of unknown function C. elegans
(worm) 19,000 genes Human genome
30-35,000 genes 20,000(?)of unknown
function
8
Goals of Proteomics
- Identification of each protein - Levels
of protein expression do not always
correlate with mRNA levels -
Post-translational modifications (PTMs)
types and sites - Protein/protein
Interactions molecular machines -
Protein localization (organellar proteomics, cell
map) - Protein function in normal, stressed
and diseases states
9
Challenges of Analyzing Proteins
Size From 50 to 100,000 a.a.
5,000-1,000,000 Da (1 Dalton 1.67x10-27kg)
Diversity in human cells 30,000
different sequences
Relative abundance broad dynamic range
10-1,000,000 copies /per cell
Different forms Post-translational modifications

93 forms of tau proteins!!
10
Proteomics Science Market

There two key proteomics technology for
identifying proteins in cells
Microarrays (cheap), at the mRNA level
Mass Spectrometers () at the protein level.
Importance of mass spec technology
Not all mRNAs are translated
Different levels of expression
PTMs (Post Translation Modifications)
Mass Spec Market size 2 billion / year. ½ for
protein purposes. Main manufacturers Micromass,
ABI, MDS Sciex, Bruker, Thermo-Finnegan.

11
Chemical Composition of Living Matter
27 of 92 natural elements are essential. Eleme
nts in biomolecules (organic matter) H, C, N,
O, P, S These elements represent approximately
92 of dry weight. Elements occurring as
ions Na, K, Mg, Cl-, Ca Trace
elements B, F, Al, Si, V, Cr, Mn, Fe, Co, Ni,
Cu, Zn, Se, Mo, Sn, I Fe involved in transport
of oxygen. Cu, Zn important at the active site
of certain enzymes.
12
Organic Matter
Derivatives of carbon (61 of dry weight), high
chemical versatility Single bonds C-H, C-C, C-O,
C-N, C-S Double bonds CC, CO, CN Triple
bonds CC, CN (rare in biomolecules) Often
organized in "building blocks" amino acids
polypeptides ( proteins) monosaccharides starch
, glycogen nucleic acids DNA, RNA
13
Mass (Weights) of atoms and molecules
element nominal exact Percent average
mass mass abundance mass C 12 12.00000
98.9 13 13.00335 1.1 12.00115 H
1 1.00783 99.98 2 2.0140
0.02 1.008665 O 16 15.99491 99.8 15.994 18
17.9992 0.02 N 14 14.00307 99.63 14.006
7 15 15.00011 0.37 S 32 31.97207 95.0
33 32.97147 0.76 32.064 34 33.96786 4.22
P 31 30.9738 100
14
Amino Acids (20)
MW before losing water
Exact Mass of Amino Acid Residues in Proteins
Note Leu (L) Ile (I) 113.08410
15
A table of amino acid masses
16
To identify the amino acid sequence of a protein
from a tissue, the tissue is first reduced to a
fraction, typically an organelle. This organelle
may contain 500 2000 different proteins. There
are many copies of each protein. The fraction is
then run on a 1D or 2D gel to separate the
proteins further. The 2D gel separates the
proteins by mass and charge. Typically, each
spot in the 2D gel contains many copies of one
protein. Each spot is excised from the gel and
digested with trypsin. Trypsin cleaves the
protein at every occurrence of the amino acid
Arginine (R) and Lysine (K). The resulting
protein pieces are called peptides.
17
Typical 2D Gel
18
Cleavage with trypsin (tryptic digestion)
Trypsin cleaves at only at peptide bond Rn-1
Lys, Arg Rn ¹ Pro
Xxx-Xxx-Xxx-Xxx-Xxx-Xxx-Xxx-Xxx-
Rn-1 Rn R n1
Example
Trypsin
Trypsin
?
?
Arg - Gly - Phe - Lys - Ile - Ala - Glu - Trp -
Met
MW (Average mass) 1136
Treatment with trypsin gives 3 fragments 1. Arg
(MW 174) 2. Gly - Phe Lys (MW
57147128 18 350) 3. Ile-Ala-Glu-Trp-Met
(MW 11371129 186 18 648
Note each internal peptide will end with Lys (K)
or Arg (R)
19
Tools for Protein Identification
Mass Spectrometry
Two ionization techniques
Matrix Assisted Laser Desorption/Ionization
(MALDI)
Electrospray Ionization (ESI)
Both with many types of mass analyzers TOF,
Quadrupole (Q), Q-TOF, FT ICR MS, Q-ITOF, etc.
Enzyme Digestion To get smaller fragments
Trypsin (99), LysC, others
20
The Mass Spectrometer
MALDI ESI
Time-of-flight Quadrupole Ion trap
21
Mass spectrometer
detector
ionization
mass analyzer
sample
ions
mass spectrum
intensity
m/z
22
MS fingerprint for proteins
23
MS fingerprint for protein
protein
MPSESSYKVHRPAKSGGS
trypsin digestion
MPSESSYK
VHR
PAK
SGGS
peptides
24
A real spectrum for hemoglobin
25
Search a database for match
MPSESSYKVHRPAKSGGS
another protein

in-silicon digestion
in-silicon digestion
Theoretical Spectra

Real Spectrum
26
Mascot interface
27
Too many matches

There are many peptides in the database with the
same mass.
There are many missed peaks in the MS.
There are plenty of noises in the MS.
For each MS, there could be many proteins in the
database that match the MS.

28
Tandem MS
29
Tandem MS
detector
mass analyzer
mass analyzer
fragmentation
ions
30
Tandem MS
MPSESSYKVHRPAKSGGS
Tryptic digestion
MPSESSYK
VHR
PAK
SGGS
fragmentation MS
fragmentation MS
31
How does a peptide fragment?
m(b1)1m(A1) m(b2)1m(A1)m(A2) m(b3)1m(A1)m(
A2)m(A3)
m(y1)19m(A4) m(y2)19m(A4)m(A3) m(y3)19m(A4)
m(A3)m(A2)
C terminal
N terminal
Note C terminal has extra H2O, 18 daltons. a ion
mass is b ion mass -28, c ion mass is b ion mass
15
32
Peptide sequencing

A protein/peptide sequence consists of 20
different types amino acids.
Most amino acids have distinct masses.
Different peptide (400-2000 Daltons) sequences
will produce different MS/MS spectra.
Peptide sequencing
Preprocess
Infer a.a. sequence de novo or by database
Deal with PTM

33
Protein identification after peptide sequencing

Search a database to find the protein that
contains the most peptides.
From the peptide sequences (or partial
sequences), design primers to PCR the region of
the gene that encodes the protein, then do a DNA
sequence of the gene.
Assemble the peptide sequences together
not very realistic because the loss of ions.

34
The MS/MS spectrum
N-term
C-term
35
More ions

Each N-term ion might lose an ammonium
a-NH3, b- NH3, c- NH3
Each C-term ion might lose a water
x-H2O, y-H2O, z-H2O,
An ion sometimes is doubly/triply charged
m/z value is halved.
One peptide can be fragmented twice
internal cleavages, imonium ions.

36
A real MS/MS spectrum with good quality
LGSSEVEQVQLVVDGVK
37
Preprocessing
38
Raw data
39
Preprocess
40
Noise subtraction

Estimate a baseline and subtract the baseline
value from each peak.
constant baseline
linear baseline
quadratic baseline

41
Isotopic ions

There are two isotopes of carbons. 98.9 of
carbon atoms in the nature is C-12, whose mass is
12 dalton. 1.1 of carbon is C-13, whose mass is
13 dalton.
Because an ion may have 0,1,2,,k C-13s, its
mass can be m, m1, m2, , mk.
The peak at mi is called isotopic ion peaks.
Are isotopic peaks very low?

42
Isotopic ions

Each amino acid has several carbons, and each ion
has several amino acids. If there are 50
carbons, the probability to have more than one
carbon is
1-(0.989)50 0.424
In this example, the sum of intensities of the
isotopic ion peaks is at high as the
mono-isotopic peak.

43
Multiply charged ions

An ion with mass m can have more than one charge
(proton). If z protons are attached, the m/z
value is (mz)/z.
How do we determine the charge state of a peak?
Using isotopic ion peaks.

44
Multiply charged ions

The isotopic ions have masses m, m1, m2
Therefore, the m/z values are
m/z, (m1)/z,
The m/z difference of the isotopic ions can tell
the charge state.

45
Deconvolution

Deconvolution is the procedure to convert
multiply charged peaks to singly charged, and sum
up all the isotopic ion peaks.

46
Other ion types

http//www.matrixscience.com/help/fragmentation_he
lp.html

47
Database search methods

Mascot
SEQUEST

48
Database search methods

For each possible peptide in the database,
do an in-silicon fragmentation to compute the
theoretical spectrum
compare the spectrum with the real spectrum,
compute a matching score this is key.
Output the peptide with the best matching score.

49
Mascot
50
Weakness of database method

Does not work for unknown proteins
Trouble with PTMs

51
De novo sequencing
52
Spectrum Graph Approach

Each node of the graph represents a peak in the
spectrum.
Two nodes have an edge iff the two corresponding
peaks are distanced with the mass of an amino
acid.
The path that connects the two ends corresponds
to a feasible solution.

53
Weakness of Spectrum Graph

Missing ions causes no solution.
Add new edges that connect two peaks with
distance equal to the total mass of two (or more)
amino acids.
Noise causes too many feasible paths.
Add weights to the nodes and edges. Find the
longest path.

54
References

Dancík, V., Addona, T., Clauser, K., Vath, J.,
and Pevzner, P. 1999. De novo protein sequencing
via tandem mass-spectrometry. J. Computational
Biology 6, 327-341.
Chen, T., Kao, M-Y., Tepel, M., Rush J., and
Church, G. 2001. A dynamic programming approach
to de novo peptide sequencing via tandem mass
spectrometry. Journal of Computational Biology
8(3), 325-337.
Lutefisk
Taylor, J.A., and Johnson, R.S. 1997. Rapid
Commun. Mass Spectrom. 11, 1067-1075.
Taylor, J.A., and Johnson, R.S. 2001. Anal. Chem.
73, 2594 - 2604.

55
The Sandwich Algorithm
B. Ma, K. Zhang, C. Liang An efficient algorithm
for peptide de novo sequencing from MS/MS
spectrum, CPM03.
56
Ions

If Pa1a2 ak, mass of P is denoted as
P Sj1..k aj
But the actual mass would be P18 because
of extra H2O at the C terminal end.
When cut, b ions actual mass is its mass plus 1
(proton), and y ions actual mass is its mass
plus 19 (18 plus 1 proton).
Hence, yk-i bi 20P, for each i.

57
If only consider y-ions

Let the sequence be a1a2an , then the y-ions
are
19m(an), 19m(an)m(an-1),
The matching score that a1a2an matches the
experimental spectrum is defined to be
the sum of all the intensities of the peaks
around the y-ion positions.
De novo sequencing
Find the sequence that maximizes the matching
score.

58
Matching score
R
V
L
L
A
N
Q
F
G
Y
E
G
L
59
Dynamic programming solution

Define DPx to be the maximum matching score
caused by a suffix with y-ion mass x.

60
Dynamic programming solution

If x0, DPx 0
If x0,

1. Let m be the peptide mass 2. for x from 0 to
m, stepsize ? DPxh(x) max
DPx-m(a) 3. find the sequence of amino acids
that leads to DPm(backtracking), output the
sequence
61
If consider both b and y ions

Let DPx,y be the matching score caused by a
length x prefix and a length y suffix. Then
If xy,
If yx,

62
Formalize

Let Aa1ak. If A is a b-ion, we denote its mass
with
Ab1Si1..k ai.
If A is a y-ion, we denote its mass with
Ay19Si1..k ai.
If x is the mass of a b-ion, the related a-ion,
c-ion are x-28 and x17, losing water is -18, and
losing ammonia is x-17. Denote this set as
B(x)x-28,x-18,x-17,x,x17
Similarly if x is mass of a y-ion, its related
ion set is
Y(x)x-18,x-17,x,x26
If P is a peptide of length n, theoretical peaks
are
S(P) Ui1n-1 B(bi) U Y(yi)
For set S, spectrum M, let the set of matched
peaks in S be
S (xi,, hi) e M there is y in S
s.t. y-xid
Given S, h(S)S(x,h) in S h

63
De novo Sequencing Problem

Input M(xi,hi)i1,n, M (P20), error d.
Construct P such that P20-M d, h(S(P))
is maximized.

64
Chummy Pairs

Let Pa1a2 an.
For prefix Aa1ak, set of ions caused by A
SN(A)Ui1..kB(a1aib) U
Y(M-a1aib)
For suffix Aan-k1an, set of ions caused by A
SC(A)Ui1..kY(an-i1any) U
B(M-an-i1any)
If PAaA, then S(P)SN(A) U SC(A)
Definition. (A,A) is called a chummy pair if
Ab Ay
and either of the following holds
A1 A-1b (7)
A2 Ay Ay (8)

65
The Lemmas

Lemma 1. Let (A,A) be a chummy pair and
f(u,v,w)hB(u) U Y(M-u) B(v) U Y(M-v) U
Y(w) U B(M-w), Then
(i) if (Aa,A) is a chummy pair, then
h(SN(Aa) U SC(A))h(SN(A) U SC(A))
f(Aab, Ab , Ay)
(ii) If (A,aA) is a chummy pair, then
h(SN(A) U SC(aA)h(SN(A) U SC(A))
f(M-aAy , M-Ay, M-Ab)
Proof.
To prove (i), by definition,
SN(Aa) U SC(A) SN(A) U SC(A) U B(Aab)
U Y(M-Aab). (13)
By chummy pair definitions, and AaAwe have
B(Aab) n SN(A1A-1) U SC(A2A) F
Y(M-Aab) n SN(A1A-1) U SC(A2A)
F
For the above two formulas, we have
F SN(A) U SC(A) n B(Aab) U
Y(M-Aab) B(Ab) U Y(M-Ab) U
Y(Ay) U B(M-Ay)
(14)
(13)(14) prove (i). The proof of (ii) is
similar.

Lemma 2. Let (A,A) be a chummy pair and Ab
Ay a
(i) If Abpair and (A,aA) is not
(ii) If Ay Ab, then (A,aA) is a
chummy pair and (Aa,A) is not.
Proof. Draw pictures using definition.
Lemma 3. Let (A,A) be a chummy pair. Then
exactly one of (A1A-1,A) and (A,A2A) is
a chummy pair.
Proof. Draw pictures using definition, (7)(8).
Lemma 4. Let P be the optimal solution. Then
there is a chummy pair (A,A) and an amino acid a
so that PAaA.
Proof. Let Pa1a2am. By Lemma 2, we find (A,A)
by
1. Let A and A be the empty string
2. for i1 to m-1 do
if Ab
else Aam-AA.
Now by Lemma 4 and S(P)SN(A) U SC(A), for
PAaA, to find an optimal solution we look for a
chummy pair s.t.
There is a letter a AbAya-M d
h(SN(A) U SC(A)) is maximized.
Let DP(x,y) be the max value of h(SN(A) U SC(A))
for all chummy pairs such that Abx and
Ayy.

67
Algorithm Sandwich

Input Peak list M, mass value M, error bound d,
calibration ?
Output Peptide P s.t. h(S(P)) is max and
P20-Md.
Initialize DPi,j-8 DP1,190
for x1 to M/2maxa step ? do
for yx-maxa to min(xmaxa, M-x)
step ? do
for a in S
if x
DPxa,ymaxDPxa,y,
DPx,yf(xa,x,y)
else
DPx,yamaxDPx,ya,
DPx,yf(M-y-a,M-y,M-x)
Compute the best DPx,y for all x,y,a satisfying
xya -M d.
Compute the best A, A and a by backtracking,
output AaA.
Note max in above DP is needed, as xa may
equal x?a

68
PEAKS vs Lutefisk with 3rd party data
Red means correct. PEAKS is 20 times
faster.
69
Side Topic Biomarkers from mass spec data

SELDI
Surface Enhanced Laser Desorption/Ionization. It
combines chromatography with TOF-MS.
Advantages of SELDI technology
Uses small amounts (sample (biopsies, microdissected tissue).
Quickly obtain protein mapping from multiple
samples at same conditions.
Ideal for discovering biomarkers quickly.

70
ProteinChip Arrays
71
SELDI Process
copy from http//www.bmskorea.co.kr/new01_21-1.htm
72
Protein mapping
C
C
N
N
C
73
Biomarker Discovery

Markers can be easily found by comparing protein
maps.
SELDI is faster and more reproducible than 2D
PAGE.
Has been being used to discover protein
biomarkers of diseases such as ovarian cancer,
breast cancer, prostate and bladder cancers.

(Normal)

(Cancer)
Modified from Ciphergen Web Site)
74
Inferencing biomarkers

Inference using SVM
Decision list
Other learning tools

75
Assignment 2

Prove Lemma 1 case (ii). (note the change, Lemma
2 is changed to 1)
Give a complete proof for Lemma 3.
Can you improve the sandwich algorithm to take
internal fragmentation ions into consideration?

76
Term Project Options

Investigate the de novo sequencing problem when
there are internal fragmentations and
post-translations modifications (PTMs) in the
peptide.
Biomarker inference algorithms (need to use real
data here, download from web).
When two peptides are in the same spectrum,
design a good algorithm to separate and sequence
them.
Study good ways to deal with PTMs in database
search, avoiding exponential growth in number of
PTMs.

77
Open Questions

Polynomial time de novo sequencing algorithm with
internal fragmentations and PTMs.
Effective de novo sequencing algorithm for low
grade (ion trap) data.
Less important but cute Given (peptide) mass M,
find a segment in protein database with mass M,
in time shorter than O(n/logn). (Mark Cieliebak
question) Preprocessing allowed, but should not
use too much memory.

78
Acknowledgement