Title: Computer Analysis of Mass Spectrometry Data
1Computer Analysis ofMass Spectrometry Data
- David Perkins
- Proteomics Section,
- Hammersmith Hospital Campus,
- Imperial College School of Medicine.
- david.perkins_at_imperial.ac.uk
2Introduction
- Background to protein sequencing and
identification using mass spectrometry (MS) - Software and computational techniques for
analysis of MS and MS/MS data.
3(No Transcript)
4Peptide Mass Fingerprinting
- Simplest form of protein identification (not
sequencing) - Majority of proteins in a sample are identified
using this technique - Involves a simple enzymatic digest of a protein
and the measurement of the mass of the resultant
peptide fragments - Concentrations as low as 10 femtomoles (10-15)
5Sample Preparation for Peptide Mass Fingerprinting
- Excise band from gel
- Tryptic Digestion of gel fragment
- Supernatant transferred to fresh eppendorf
- Sample transferred to target plate
6Enzymatic Cleavage
Peptide Fragments
Native Protein
Enzyme
7Sample Preparation Robot
8MALDI Mass Spectrometer
- Matrix Assisted Laser Desorption Ionisation
- Peptides are mixed with matrix and then applied
to wells on a target plate - Peptide ions are generated by a LASER firing at
the target plate - The time of firing of the LASER and the arrival
time of the ions at the detector are known, the
relative masses can then be calculated - Only singly charged ions are generated, other
types of spectrometer may generate multiply
charged ions
9MALDI Internals
10Micromass MALDI
11Typical Fingerprint Spectrum
12Isotopic Cluster
13Poorly Resolved Peak
14Protein Identification Using Peptide Mass
Fingerprinting
- Produce a theoretical digest of all the proteins
in a database with a specific enzyme - Compare these theoretical masses with
experimentally observed masses - Assign a score to matching peptides/proteins
15Which Observed Masses to Include ?
The optimum dataset for a peptide mass
fingerprint is all the correct peptides and none
of the wrong ones ! By correct, we mean that the
textbook cleavage rules were followed. In
practice, this rarely (if ever) happens.
- Enzymatic cleavage not perfect (partials)
- Sequence coverage may be poor
- Mixtures and contamination
- Identifying real peaks
- Residue modifications
- Mass accuracy
16Choice of Enzyme
- Enzymes of low specificity are next to useless as
they produce a complex mixture of similar masses - Enzymes of high specificity may produce no
cleaved peptide at all - Trypsin especially good since this ensures basic
residues are at the C terminal of a peptide and
so reduces their disruptive influence on peptide
fragmentation
17Enzyme Specificity
Enzyme Cleave At Dont Cleave
N or Cterm
Trypsin KR P
C
Lys-C K P
C
Lys-C/P K
C
Arg-C R P
C
V8-E E P
C
V8-DE DE P
C
Chymotrypsin FYWLIVM P C
18Missed Cleavages
- Digests are usually not perfect
- Cleavage sites may be missed by an enzyme
- These incorrectly cleaved peptides are known as
partials - Reduce the discrimination of a search
19Search Masses
- Select masses which are large enough to provide
discrimination - Larger masses are more likely to be imperfect
cleavages - Masses smaller than 500 Da likely to be matrix
- With Trypsin, a mass range of 1000 to 3000 Da is
usually safe - Mass tolerance is important in obtaining good
discrimination
20Constraining Protein Mass
- To increase discrimination, the mass of the
intact protein can be used in a search - This is dangerous since this may be just a
fragment of an entire protein
21Autolysis Products
- Some digests may be dominated by the autolysis
peaks of the enzyme used - In these cases, the known masses of these
products may be filtered
22Residue Modifications
- Some residues may be modified during the sample
preparation procedure - This introduces discrepancies in the expected and
observed masses - For example, Met residues are often oxidised
23MOWSE
- One of the first programs for identifying
proteins by peptide mass fingerprinting - Developed by Darryl Pappin and Alan Bleasby
- Developed alongside the OWL non-redundant protein
database
24Problems with MOWSE
- Databases had to be pre-indexed, these indexes
are large and slow to build - Does not handle variable modifications
- Indexing means that databases cant be regularly
updated easily - Limited functionality
25MASCOT
- Take advantage of multi-processor systems
- Totally web based
- No pre-indexing of databases
- Increased functionality
- Copes with multiple modifications
- Easily expandable
- Increased speed
26Search Speed
Search speed is very important as databases
increase in size and automation leads to a high
throughput of samples. Also, if the algorithms
are efficient more elaborate searches may be
undertaken, for instance with large numbers of
variable residue modifications and different
mass tolerance to attempt to make more sense of
data derived from mixtures or with contamination
- Ability to use multiple processors when available
- Very efficient I/O, databases may also be mapped
to memory - Efficient cleavage site and mass calculation
27Thread Models
- Boss/Worker
- Peer
- Pipeline
- MASCOT is based on the Boss/Worker model
28Boss/Worker Model
Output
The Boss accepts input and then distributes the
work to other threads
29Peer Model
Output
Output
Output
Each Thread is responsible for its own input
30Pipeline Model
Input Stream
Output
Thread A
Thread B
Thread C
A single thread accepts input, passing the data
on to the next thread for further processing
31(No Transcript)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35Peptide Mass Fingerprinting -Related Search
Methods
- Masses may be combined with sequence information
1234.5 seq(c-ABCD) seq(EF) - These searches are very valuable as even small
amounts of sequence information may be very
discriminating - Sequence information is derived from the partial
interpretation of a MS/MS spectrum - Know as the sequence tag method
36Composition Queries
- Composition information may also be used with
mass information to refine queries - Chemical or enzymatic analysis, such as N
terminal analysis with Edman, may give
composition information - A typical query would
be 1234.5 comp(2H0M)
37MASCOT Queries
- One of the most powerful features of MASCOT is
the ability to mix all the types of query in one
search - MASCOT allows the user to specify a particular
species to further increase search discrimination
38Databases Searched with Peptide Mass Fingerprint
Data
- Non-identical protein databases are the ideal
- EST sequences are too short to contain meaningful
information for these searches - Non-redundant databases may be problematic
- MASCOT translates nucleic acid databases on the
fly
39MSDB
- A non-identical protein sequence database
designed for mass spectrometry searches - Additional information, such as multiple species
lines, in the textual information - De-convolution of SWISSPROT and other sequences
- Nightly updates
- Links to source databases
40Is The Protein Identified ?
- Most samples are identified using just peptide
mass fingerprinting - With the growth of databases, this trend will
continue - Some samples do not have representatives in any
of the databases, to sequence these proteins more
analysis is required
41(No Transcript)
42MS/MS Analysis
- Also known as tandem MS
- Individual peptides from the enzymatic digests
are fragmented further - From this ladder sequences may be reconstructed
- Much more discriminating search than simple
peptide mass fingerprinting
43MS/MS Analysis
- Carried out on nanospray/electrospray mass
spectrometers - Rather than spotted on a target plate, the sample
is introduced through an inlet from a capillary - Peptides identified by the MALDI analysis are
fragmented inside the mass spectrometer and the
resultant daughter ions observed
44Stylized Nanospray Mass Spectrometer
45Micromass QTOF
46Finnigan Ion Trap
47Daughter Ions
- Unlike the MALDI, ions produced by
electrospray/nanospray machines may carry
multiple charges - Various types of ions are produced, categorized
by their charge and their direction in the
peptide sequence - Fortunately the peptides fragment at the peptide
bonds
48B and Y Fragment Ions
Y-ions from C to N terminus
Y ion
Y ion
Y ion
3
1
2
O
O
O
O
C
C
C
C
OH
NH
2
B ion
B ion
B ion
3
1
2
B-ions from N to C terminus
49 Typical MS/MS Spectrum Mass is on the X axis,
intensity on the Y axis
50MASCOT Searches with MS/MS Data
- In a similar fashion to peptide mass
fingerprinting, the predicted fragment ion mass
from each peptide of a database sequence are
calculated - The calculated and observed ion masses are
compared and given a score - Individual peptide scores are combined to give a
protein score
51Problems with MS/MS data
- The type of daughter ions produced may be large
and are dependant on the machine and analytic
procedure used - Searches tend to be used with a no enzyme
option which introduces a large number of
calculations as peptide boundaries cant be
predicted - Residue modifications are far more difficult to
handle, the number of mass permutations being
very large
52(No Transcript)
53(No Transcript)
54(No Transcript)
55(No Transcript)
56(No Transcript)
57(No Transcript)
58(No Transcript)
59(No Transcript)
60(No Transcript)
61(No Transcript)
62Databases Searched with MS/MS Data
- Non-identical protein databases are the ideal
- EST databases translated in 6 frames are very
useful as individual peptides may be identified - Translated nucleic acid databases
- Non-redundant databases create problems
63De-Novo Sequencing
- If the protein is still not identified, the
sequence of a peptide has to be reconstructed
from the MS/MS data - Very time consuming and demands a great deal of
skill, noisy data is very problematic - Sequencing is carried out by finding mass
differences between peaks that correspond to
amino acid masses
64Tags
- Easy to find initial masses in ladder
- Tags modify the fragmentation of the peptide
- Reduce isobaric problems
- Neutralise the adverse effects of certain
residues on peptide fragmentation
65Example Tags
66 2044.9 1933.7 1862.7 1763.7 1634.6 1521.6
1450.4 1321.4 1220.3 1106.5 959.4 831.3
760.4 689.3 618.2 517.0 446.3 375.1
Gln Ala Val Glu Xle
Ala Glu Thr Asn Phe
Gln Ala Ala Ala Thr
Ala Ala Thr Lys
256.1 327.3 426.2 555.5 668.2
739.2 868.2 969.2 1083.1 1230.6 1358.3
1429.4 1500.4 1571.5 1672.3 1743.5 1814.4 1915.5
100
2.71
y-ion series b-ion series
80
60
40
1571.45
20
1725.29
1814.37
500
1000
1500
2000
67(No Transcript)
68Automation
Automation is critical to maintain a high
throughput of samples. It is essential to produce
closer integration of machine control and data
analysis software
- New generation of Mass Spectrometers, quadrupole
machines with LASER sources - Laboratory Information Management Systems
- Automated sample preparation
69Laboratory Information Management System
Mass Spectrometer
Data Reduction Peak Processing
Submission into Microarray/Proteomics database
MASCOT Search Engine
Re-search after database updates
Protein Identified
Protein not Identified
Automatic report generation for sample submitter
Via WWW
Results database
70Future of MASCOT
- Homology searching
- Post processing of results for easier
interpretation - Distributed processing - Linux cluster. MASCOT is
based on the Boss/Worker model so is easy to port - Development of a standard API to allow simpler
automation and extensions to functionality
71MASCOT Homology Searching
- Identification dependant on at least some of the
peptide sequences being identical to a database
sequence - Homology searching (for instance allowing common
substitutions to occur by default) would overcome
this limitation - Lead to less selectivity and also increased
search times
72Post processing of Results
- Allows easier interpretation by, e.g. removing
all identical peptide matches from the report
page - Text mining to interpret the results of a search,
for instance are all the proteins identified
involved in a particular cellular process ? - Important when dealing with quantitative studies
73Distributed Processing
- Ability to use as much processing power as
possible when dealing with high throughput data,
for instance the thousands of peptides from LC
MS/MS - Implemented in MASCOT using a MPI style mechanism
- has the ability to dynamically add/remove
processors for data processing
74Processing Farm
75Standard Programming Interface
- A standard interface to MASCOT routines allowing
users to, e.g produce a bespoke interface - Allows integration with instrument control
software (although this is dependant on the
goodwill of the manufacturers !)
76MSDB developments
- Inclusion of variable splicing regions from
SWISSPROT - Integration of textual information from all
source databases - Clustering of highly similar sequences into
families with extra annotation - Inclusion of more translations from nucleic acid
databases
77Identification of proteins using short peptide
sequences
- FASTS is most commonly used tool at the moment,
but it is relatively slow and doesnt take into
account peptide masses and other information - New functionality for MASCOT based on tri-peptide
indices and using mass and residue modification
information
78MS/MS Data Mining
- MS/MS data may contain useful information in
addition to sequence - Statistical methods for mining MS/MS data for, eg
fragmentation efficiency etc - Predictive tool for de-novo sequencing
- Understanding of physical/chemical processes
involved in fragmentation
79PEDRo
- Software and schemata for modelling, capturing
and disseminating proteomics experimental data - Lack of this system hinders the handling,
exchange and dissemination of proteomics data - Implemented in XML
- Analogous to the MIAME guidelines for
transcriptomics - http//pedro.man.ac.uk
80PEDRo Schema
81Matrix Science
- Dr. John Cottrell
- Dr. David Creasy
- URL http//www.matrixscience.com
82Imperial College
- Prof. Darryl Pappin (now director of research
ABI) - Dr Mike Bartlet-Jones
- Dr Inga Bellahn
- URL http//csc-fserve.hh.med.ic.ac.uk