Title: Experimental
1Experimental Bioinformatic Tools for Proteomics
- Steve Oliver
- Professor of Genomics
- Faculty of Life Sciences
- The University of Manchester
- http//www.cogeme.man.ac.uk
- http//www.bioinf.man.ac.uk
2Functional Genomics
Level of Analysis Definition Status Method of Analysis
Genome Complete set of genes of an organism or its organelles. Context-independent (modifications to the yeast genome may be made with exquisite precision. Systematic DNA sequencing.
Transcriptome Complete set of mRNA molecules present in a cell, tissue or organ. Context-dependent (the complement of mRNAs varies with changes in physiology, development or pathology. Hybridisation arrays. SAGE High-throughput Northern analysis.
Proteome Complete set of protein molecules present in a cell, tissue or organ. Context-dependent. 2-D gel electrophoresis. Peptide mass fingerprinting. Two-hybrid analysis.
Metabolome Complete set of metabolites (low molecular weight intermediates) present in a cell, tissue or organ. Context-dependent. Infra-red spectroscopy. Mass spectometry. Nuclear magnetic resonance spectometry.
 Â
3GENOME
TRANSCRIPTOME
PROTEOME
METABOLOME
4Proteomics
- Separation
- Identification
- Quantitation
- Bioinformatics
5Complex mixture analysis
genome
knowledge prediction
peptide mass database
virtual proteome
post-translational modification
Bioinformatics Identification
real proteome
separation methods
2D-gels, functional separations, n-dimensional chr
omatography
digest
complex peptide map fingerprint
complex mixtures subsets
digest
simple peptide map fingerprint
simple mixtures single proteins
6(No Transcript)
7Peptide mass fingerprinting
denature
KETAAAKFERQHMDSSTSAASSSNYCNQMMKSRNLTKDRC LPVNTFVHE
SLADVQAVCSQKNVACKNGQTNCYQSYSTMS ITDCRETGSSKYPNCAYK
TTQANKHIIVACEGNPYVPVHF DASV
digest (trypsin)
KETAAAK FER QHMDSSTSAASSSNYCNQMMK SR
NLTK DR CLPVNTFVHESLADVQAVCSQK
NVACK NGQTNCYQSYSTMSITDCR ETGSSK
YPNCAYKTTQANK
HIIVACEGNPYVPVHFDASV
m1 m2
m3 m4 m5 m6
m7
m8 m9
m10 m11
m12
mass spectrometry
m12
m11
m7
m9
m10
abundance
m1
mass
8Proteomic applications
- Quantitative Proteomics
- Expression proteomics
- protein levels under different conditions/times
- Qualitative Proteomics
- Identification proteomics
- proteinprotein interactions
- post-translational modifications
9A MASS SPECTROMETER MEASURES THE MW.
...A MS ANALYSIS GIVES THE
MASS-TO-CHARGE RATIO (m/z) FOR IONSIN GAS PHASE.
Brancia FL, Trieste, 12/02/2004
10What is a mass spectrometer...?
Brancia FL, Trieste, 12/02/2004
11Pumping system
vacuum
ION SOURCE (ion generation)
ANALYZER (mass analysis)
Sample introduction
Detector
Data Processing
Brancia FL , Trieste, 12/02/2004
12Various ionisation methods
- Electron impact ionisation (1919 A.J. Dempster)
- Chemical Ionisation CI
- Fast atomic bombardment FAB (1981 M. Barber)
- Matrix-assisted laser desorption ionisation MALDI
(1988 K. Tanaka, M. Karas F. Hillenkamp) - Electrospray ES (1985, J. Fenn)
Brancia FL, Trieste, 12/02/2004
13Soft Ionisation Techniques
- Soft refers to the low amount of energy
imparted into the analyte during ionisation. Too
much internal energy will result in
fragmentation. Soft ionisation techniques form
intact molecular or pseudo-molecular (MH) ions. - Matrix-assisted laser desorption
ionisation (MALDI) - Electrospray (ES)
Brancia FL, Trieste, 12/02/2004
14Brancia FL , Trieste, 12/02/2004
15Electrospray (ES)
Brancia FL, Trieste, 12/02/2004
16Brancia FL, Trieste, 12/02/2004
17The principal outcome of the electrospray process
is the transfer of analyte species, generally
ionised in condensed phase, into the gas phase as
isolated entities
HV
Aerosol of charged droplets
Gaskell SJ Jounal of Mass Spectrometry 1997
Brancia FL, Trieste, 12/02/2004
18 ES spectrum of Rho protein
Rho Protein 47004.33 Da
M56H56
M50H50
Courtesy of Dr Matt Openshaw
Brancia FL, Trieste, 12/02/2004
19Electrospray (ES)M56H56 840.3
m/zTherefore, M 840.3 x 56
56 47000.8 DaDeconvolution Takes all
the multiply charged ions and converts them into
a spectrum on a mass (Da) scale i.e. works out
the molecular weight is most likely to be.
Brancia FL, Trieste, 12/02/2004
20ES spectrum after deconvolution
47004.0 Da
Brancia FL, Trieste, 12/02/2004
21Advantages
- Production of molecular ions from solution
- The ease of coupling with separation techniques
(micro LC-MS/MSMS, nano LC-MS/MSMS) - Production of multiply charged ions
Brancia FL, Trieste, 12/02/2004
22Matrix Assisted Laser Desorption IonisationMALDI
Brancia FL, Trieste, 12/02/2004
23Matrix assisted laser desorption ionisation
(MALDI)
?-cyano-4-hydroxy cinnamic acid (CHCA)
2,5-dihydroxybenzoic acid (DHB)
Trans-3,5-dimethoxy-4- hydroxy cinnamic acid
(sinapinic acid SA)
Typically used with a nitrogen laser (337 nm)
Brancia FL, Trieste, 12/02/2004
24MALDI is an efficient desorption ionisation
technique for producing gaseous ions from a solid
sample by laser pulses
MH
Brancia FL, Trieste, 12/02/2004
25Matrix Assisted Laser Desorption/Ionisation
(MALDI)Unlike ES, MALDI forms predominantly
singly charged ions e.g. MH or adducts
(sodium MNa or potassium MK) Sodium
23 amu Potassium 39 amu
MH
MNa
22 m/z
MK
38 m/z
Brancia FL, Trieste, 12/02/2004
26Why is the matrix so important?
- Matrix is necessary to dilute and disperse the
analyte - It functions as energy mediator for ionising the
analyte itself or other neutral molecule - It forms an activated state produced by photo
ionisation
Brancia FL, Trieste, 12/02/2004
27Advantages
- MALDI primarily creates singly charged ions
MH - Less sensitive to contaminants
- Sensitivity at femtomole level
- High throughput analysis
Brancia FL, Trieste, 12/02/2004
28Time-of-flight (ToF) mass spectrometer
mv2/2 zV
t2m/z(d2/2V)
Brancia FL, Trieste, 12/02/2004
29Reflectron-time of flight mass analyser
Brancia FL, Trieste, 12/02/2004
30MALDI ESI
Sensitivity femtomole 10-15 M/?l
(...attomole 10-18 M)
Simplicity very easy training required
70 to 650 k 120 to 650 k
Speed (high throughput) 104/day
dynamic system
Selectivity (resolution) gt5000
Structural information MSn
MSn
Software ...evaluation in
progress.
Brancia FL, Trieste, 12/02/2004
31Structural information can be achieved by tandem
mass spectrometry
Brancia FL, Trieste, 12/02/2004
32The tandem mass spectrometry experiment
Brancia FL, Trieste, 12/02/2004
33Brancia FL, Trieste, 12/02/2004
34- PROBLEMS WITH CLASSICAL
- PROTEOME ANALYSIS
- Not comprehensive
- 2. Not high-throughput
- 3. Destroys protein-protein interactions
- that provide important clues to function
35(No Transcript)
36 - Multidimensional protein identification
technology (MudPIT) - Washburn MP, et al Nat Biotechnol 2001,
19242-247.
Reverse Phase
SCX
Load complete digest of sample
Develop with gradient and spray directly onto
MSMS
MS/MS
Identified 1500 proteins from yeast including
lower abundance species and membrane
proteins 2415 (46) of Plasmodium genome
identified in all 4 stages of parasitic life cycle
37Just Enough Diagnostic Information
38Sidhu KS, Sangavich P, Brancia FL, Sullivan AG,
Gaskell SJ, Wolkenhauer O, Oliver SG, Hubbard
SJ (2001) Bioinformatic assessment of mass
spectrometric chemical derivatisation techniques
for proteome database searching. Proteomics 1,
1368-1377.
39- Provide limited sequence information by
- Identification of N-terminal amino acid by
- PTC derivatisation
- 2. Use guanidination to identify C-terminus,
- determine lysine content, and improve
- signal response
- 3. Specifically fragment next to Asp residues
using MALDI-QToF MS
40PTC-derivatisation
- phenylthiocarbamoyl derivative
- Edman chemistry
- N-terminal amino acid
- b1 ion created via low energy collisions
- precursor ion scan gives parents
- increased sensitivity
ms2
peptide ions
ms1
scan for precursors
fixed on b1
collision cell
Spectra collected of all peptides which give rise
to a given b1 ion (implying knowledge of the
N-terminal amino acid)
41Database peptide hits by N-terminal amino acid
N-terminal
mean number
Error 0.5 Da
Amino acid
of peptides
ANY
74.15
W
1.70
Average number of matching proteins in the yeast
proteome when searching with a peptide mass in
the 1000-2000 Da range Rare amino acids give a
bigger search gain
C
1.77
H
2.30
M
3.41
N
5.61
I
5.76
E
6.04
S
7.18
L
8.39
I/L
14.16
42Guanidation of Lysine
H2N
NH
NH
2
NH
NH2
O
H3C
NH2
NH
2
O
O-methyl isourea
NH
2
OH
O
OH
lysine
homoarginine
43MALDI spectrum of an enolase tryptic digest
R
R
R
R
R
R
K
K
K
44MALDI spectrum of a tryptic digest of enolase
after guanidation
K
K
K
K
R
K
R
K
R
K
R
R
K
K
45Initial set of search peptides and associated
information
Search database, compile protein hit list with
matching peptides
If all initial search peptides masses are
matched, stop, else continue searching
Top-scoring protein is matched. Remove
corresponding peptides from search list
46Real yeast proteomics
- Alternatives to 2D-gels
- denaturing technology
- low abundance spots difficult to identify
- Many steps of orthogonal 1D-steps
- Size exclusion chromatography
- Ion exchange chromatography
- 1D-gels
47Yeast proteome sample
1752.62
Before guanidination
3570.36
1768.59
795.32
1470.68
1708.61
811.32
800
1000
1200
1400
1600
3600
After guanidination
R
1752.65
K
1512.69
K
R
925.33
3612.77
1040.30
1210.39
1150.49
1416.55
795.23
1221.90
0
800
1000
1200
1400
1600
1800
3600
Mass (m/z)
48Database search gains
1656 proteins match at least 1 peptide
Standard MALDI 7 search peptides (before
guanidination)
2549 proteins match at least 1 peptide
Standard MALDI 12 search peptides (after
guanidination)
Combined 19 (7 12) search peptides (both
experiments)
3235 proteins match at least 1 peptide
49Database search gains
peptides in common
Search peptides in common (5 from expt 1, 4
from expt 2)
Only 289 proteins match at least 1 peptide in
both experiments
PTC derivatised 3 peptides N-term Ile/Leu
Only 204 proteins match at least 1 peptide
All 3 sets of experimental data combined
Only 18 proteins match at least 1 peptide in all
3 experiments
50(No Transcript)
51(No Transcript)
52S. cerevisiae 1 protein
S. cerevisiae 2 proteins
53Improved bioinformatics approaches for complex
mixtures
primary data
secondary data
(input masses)
(experimental proteome data)
Database
Database
- proteome
- proteins
rule-based
search
system
- peptides
engine
protein hit list
protein information
(quantitative data)
(qualitative data)
possibility
probability
Final Scores
54Contextual information
- ? pI (theoretical experimental)
- ? Molecular weight (oligomerisation state)
- ? Subcellular localisation (known, predicted -
PSORT) - Molecular environment (soluble, membrane, DNA-,
- actin- associated.)
- Post-translational modifications (known,
putative, predicted) - Sequence motifs
- Homology relationships
- Non-native state digestions
55Scoring systems
- Bayesian approach
- k is hypothesis that the sample protein is
protein k, - D is mass spec fingerprint data,
- I is background information,
- P(kDI) is posterior probability for k given D
and I, - P(kI) is prior probability of k given I,
- P(DI) is a normalisation constant
56QUANTITATIVE PROTEOMICS
57DiGEDifference Gel Electrophoresis
- Ünlü M. et al (1997). Difference gel
electrophoresisa single gel method for detecting
changes in cell extracts. Electrophoresis,18,
2071-2077
58Sample 2
Sample 3
Sample 1
label with cy2 in dark 30mins _at_ 4OC
label with cy3 in dark 30mins _at_ 4OC
label with cy5 in dark 30mins _at_ 4OC
quench un-reacted dye by adding 1mM lysine in
dark 10mins _at_ 4OC
Difference Gel Electrophoresis
2D gel electrophoresis
59no difference ? presence / absence ? ? up /
down-regulation ?
60 Stable Isotope Labelling
- In vivo labelling Isotopes introduced during
cell culture - Pro Con
- Cheap Only works for microbes and
- cell culture????
- Information rich Very complex samples
- Have to deduce sequence before assigning
pairs -
N14 N15
m/z
61 Growth of C.elegans on isotopically labelled
E.coli
E.coli grown on 15N nitrogen source
E.coli grown on 14N nitrogen source
Metabolic labelling of C.elegans
Heavy mutant
Light mutant
Light WT
Heavy WT
Krijsveld et al (2003) Nat. Biotech.
Also grew Drosophila on metabolically labelled
yeast
62In vitro labelling - continued
- I Isotopes introduced during proteolysis 18O
labelled water, C-termini - II Guanidinylation of lysine using isotopes of
O-methyl isourea lysine residues - III Dimethyl labelling lysine residues
- Pro Con
- Cheap Complex peptide mixture
- Universal Small mass difference on MS
63ICAT Isotope Coded Affinity Tags
Gygi SP, et al . Nat Biotechnol 1999, 17994-999.
Isotope Coded Linker 227 / 236 (913C) amu
SH- reactive group (Iodoacetamide)
Pros Cons Universal Protein must contain
cysteine Simplified sample
64ICAT method
Biotin
Linker (heavy or light)
Thiol-specific reactive group
Gygi S, Rist B et al. (1999) Nature Biotech. 17
994.
65Control sample
Test sample
Denature (SDS) and reduce (TCEP)
Label with heavy reagent
Label with light reagent
Pool Samples
66Purify labelled peptides using avidin column
Digest overnight with trypsin
Cleave biotin portion of the tag with
concentrated TFA
LC-MSMS
67(No Transcript)
68 iTRAQ
69Ross P. et al. Mol Cell Proteomics. 2004 Sep 22
70WORKFLOW
- ? reduce, alkylate (cysteine block) and digest
protein sample with trypsin as usual - ? label each sample (max of 4) with a different
iTRAQ reagent, 100ug of protein is optimal - ? combine all iTRAQ labeled samples to one
sample mixture - ? clean up sample by Cation- Exchange-
Chromatography - ? for complex sample mixtures, pre-fractionation
is achieved by using a High-Resolution-Cation-Exch
ange column - ? analyze the mixture by LC/MS/MS
- ? results are analysed by Pro Quant Software
-
71(No Transcript)
72PROTEIN TURNOVER The missing dimension of
proteomics
JM Pratt, J Petty, I Riba-Garcia, DHL Robertson,
SJ Gaskell, SG Oliver, RJ Beynon (2002) Molec.
Cell. Proteomics 1, 579-591.
73Experimental Approach
Dilution rate 0.1h-1 Half-time 6.9h
74L3
Pratt et al., Figure 3
L1
100 d9
1467.3
1119.9
1454.1
1686.3
1795.4
2336.5
1336.2
2057.5
L3
50 d9
1119.8
L0
L1
L3
L1
L2
L2
L2
1119.9
1440.0
1444.9
0 d9
1747.1
1668.0
1317.8
1768.2
2327.2
2039.2
7527Da (3 Leu)
9Da (1 Leu)
0h 4h 6h 8h 12h 25h 51h
76Pratt et al., Figure 3
1
NADP-glutamate dehydrogenase (GDH) (3 peptides)
Hsp26(2 peptides)
0
.
8
RIAt
0
.
6
0
.
4
0
.
2
1
Pyruvate decarboxylase (PDC) (4 peptides)
Hsp71 (4 peptides)
0
.
8
RIAt
0
.
6
0
.
4
0
.
2
0
0
1
0
2
0
3
0
4
0
5
0
0
1
0
2
0
3
0
4
0
5
0
6
0
Time(h)
Time(h)
0.16
kloss (h-1) SEM
0.08
0
NADP-GDH
Hsp26
Hsp71
PDC
77Pratt et al., Figure 5
0.02-0.03 h-1
0.01-0.02 h-1
30
lt 0.01h-1
20
Distribution ()
0.03-0.04 h-1
gt 0.04 h-1
10
0
Degradation rate constant
Degradation rate constant (h-1) SEM
Protein (Spot ID)
78INTEGRATION
79Evaluating protein-interaction data
von Mering C, Krause R, Snel B, Cornell M,
Oliver SG, Fields S, Bork P (2002) Comparative
assessment of large-scale data sets of
proteinprotein interactions. Nature 417,
399-403. Cornell M, Paton NW, Oliver SG (2004) A
critical and integrated view of the yeast
interactome. Comp. Funct. Genom. 5, 382-402
80Â
81Schematic representation of the two hybrid system
in case of interaction of protein A and B
activation D
B
A
RNA POL II
DNA-binding D
reporter gene
UAS
Gene expression
82Schematic representation of the two hybrid system
in absence of interaction of protein A and B
activation D
B
RNA POL II
A
NO TRANSCRIPT
DNA-binding D
reporter gene
UAS
83(No Transcript)
84Synthetic lethals
Definition lethality is caused by mutating two
or more genes
gene1
gene1
gene2
gene2
geneA
gene3
gene3
geneB
gene4
gene4
geneC
gene5
gene5
Single essential pathway
Functionally overlapping pathways
85Asparagine-linked Glycosylation
Dolpp-GlcNAc2Man9Glc3 (Substrate)
(ALG genes are responsible for the core
synthesis)
Asp -NH -GlcNAc2Man9Glc3
STT3, OST1 WBP1, OST3 OST6, SWP1 OST2 OST5 OST4
X
Asp-NH2
SER/THR
X
SER/THR
alg mutations are synthetically lethal
with conditional mutation affecting
oligosaccharyltransferase activity
86Integrating complex data with yeast two-hybrid
data
B
Complex consists of six proteins A, B, C, D, E, F
C
A
F
D
E
In a yeast two-hybrid experiment, A interacts
with another protein
A
B, C, D, E or F?
Is
87Large-scale interaction data and the distribution
of interactions according to functional
categories.
88Quantitative comparison of interaction datasets.
89Set of confirmed Y2H interactions
Confirmation of an interaction requires
- Identification in more than one Y2H screen, OR
- The reverse interaction must have been
identified, OR - The two proteins must have been identified in the
same protein complex (from either classical or - high-throughput affinity purification
studies).
A total of 451 reliable interactions,
involving 581 proteins have been identified
from a combined data set comprising 5214
interactions and 4025 proteins
90Â Â
91PEDRo A Systematic Approach to Modelling,
Capturing and Disseminating Proteomics Data
- Taylor CF, Paton NW, Garwood KL,
- Kirby PD, Stead DA, Yin Z, Deutsch EW, Selway L,
Walker J, - RibaGarcia I, Mohammed S, Deery MJ, Howard JA,
- Dunkley T, Aebersold R, Kell DB, Lilley KS,
Roepstorff P, - Yates JR III, Brass A, Brown AJP, Cash P, Gaskell
SJ, Hubbard SJ, Oliver SG (2003) - Nature Biotechnol. 21, 247-591.
- Garwood K, McLaughlin T, Garwood C, Joens S,
Morrison N, Taylor CF, Carroll K, Evans C,
Whetton AD, Hart S, Stead D, Yin Z, Brown AJP,
Hesketh A, Chater K, Hansson L, Mewissen M,
Ghazal P, Howard J, Lilley KS, Gaskell SJ, Brass
A, Hubbard SJ, Oliver SG, Paton NW (2004) - PEDRo A database for storing, searching and
disseminating experimental proteomics data. - BMC Genomics 5, 68Â Â doi10.1186/1471-2164-5-68.
92Proteomics the state of play
- The volume of generated proteome data is rapidly
increasing - Movement towards highthroughput approaches
- Experimental techniques increasing in complexity
- Analyses also increasing in complexity
- Current publicly available proteomics data is
limited - 2DGel image databases (e.g. SWISS2DPAGE)
contain little information about sample
preparation, or analysis of results - No widely used databases of mass spectrometry
data or analyses - A robust, future-proofed, standard representation
of both methods and data from proteomics
experiments is required - Analogous to the MIAME guidelines for
transcriptomics - Users will know what to expect from datasets
(formats etc.) - Will facilitate handling, exchange and
dissemination of data - Will guide the development of effective
search/analysis tools
93PEDRo and PEML
- The PEDRo (Proteome Experiment Data Repository)
model - Specifies the information required about a
proteomics experiment - sufficient information to exactly replicate that
experiment - Organised in a manner reflecting the procedures
that generated it - Flexible enough to accommodate new technological
developments - Described in UML (Universal Modelling Language)
making it implementationindependent (effectively
a generic blueprint) - Implemented in SQL (the relational database
repository) - Also implemented in Java (later slide), and XML
(next bullet) - PEML (Proteomics Experiment Markup Language)
- The XML implementation of PEDRo for data exchange
and rapid dissemination (using XSLT to display
PEML files as web pages) - Two benefits arising from early implementation of
the model - Implementation allows the underlying technologies
to be tested - Making explicit what data might most usefully be
captured about proteomics experiments will speed
the models evolution
94The nature of proteomics experiment data
- Sample generation
- Origin of sample
- hypothesis, organism, environment, preparation,
paper citations - Sample processing
- Gels (1D/ 2D) and columns
- images, gel type and ranges, band/spot
coordinates - stationary and mobile phases, flow rate,
temperature, fraction details - Mass Spectrometry
- machine type, ion source, voltages
- In Silico analysis
- peak lists, database name version, partial
sequence, search parameters, search hits,
accession numbers
95The PEDRo UML schema in reduced form
96(No Transcript)
97The Framework Around PEDRo
- Lab generated data is encoded using the PEDRo
data entry tool, producing an XML (PEML) file for
local storage, or submission - Locally stored PEML files may be viewed in a web
browser (with XSLT), allowing web pages to be
quickly generated from datasets - Upon receipt of a PEML file at the repository
site, a validation tool checks the file before
entering it into the database - The repository (a relational database) holds
submitted data, allowing various analyses to be
performed, or data to be extracted as a PEML file
or another format
98The PEDRo Data Collator
- The tool with which a user enters information
about, and data from, proteomics experiments - The tool collates these data into a single PEML
file - The hierarchical nature of the PEDRo schema (and
PEML) is reflected in the structure of the data
entry tool - Successive stages of the experimental design are
added as children of the previous stage - Enforces an audit trail for data e.g. details of
a gel cannot be entered without first describing
the sample - A simple, filterable list of all the subrecords
present and tree-style browser act as index and
contents for the PEML file being edited
99(No Transcript)
100Conclusions
- The PEDRo model does require a substantial amount
of data - Much of this information will be available in the
lab of origin - Some data will be common to many experiments, and
therefore need only be entered once, then saved
as a template in PEDRoDC - But there are several advantages to adopting such
a model - All datasets will contain information sufficient
to quickly establish the provenance and relevance
(to the researcher) of a dataset - Datasets will be detailed enough to allow
nonstandard searches, for example, by sample
extraction technique - Tools can be developed that allow easy access to
large numbers of such datasets, from a wide range
of proteomics sites - Integration with other resources such as the
major sequence databases, will provide
sophisticated search and analysis capability - Information exchange between researchers will be
facilitated through the use of a common language
(PEML), and the ability to rapidly display
PEML-encoded data as a web page