Title: TOPP
1- TOPP -
- The
- OpenMS
- Proteomics
- Pipeline
2Outline
- Motivation typical steps in proteome analysis
- OpenMS an open source software frameworkfor
LC/MS based proteomics - TOPP - The OpenMS Proteomics Pipeline
- Example applications / use cases
3Proteomics
Same genome...
...different proteomes!
4Proteomics
- Interested in qualitative, quantitative, and
dynamic aspects of a proteome - Very complex biological samples
- Need for separation
- Cannot amplify proteins
- Cannot identify and quantitate proteins directly
5TOPP Goals
- Large data volume (gt 1 GB per experiment)
- Complex data, changing experimental designs
- Limitations of vendor-supplied software
- Data analysis currently limits the types of
experiments which can be performed! - Goal
- Bridge the gap between algorithms and
applications(computer science / biology) - needed flexible, easy-to-use software enabling
complex experimental designs
6Shotgun Proteomics
K
Digestion
Separation
Peptides
Proteins
- Idea
- Cannot analyze proteins directly? Digest
proteins into peptides - Separate peptides
- Identify and quantitate proteins through their
peptides
7Liquid Chromatography
I
S
C
8Mass Spectrometry
9Raw HPLC-MS map
intensity
MS scans(schematically)
rt
m/z
10Raw HPLC-MS map
intensity
rt
m/z
11Raw HPLC-MS map
intensity
Features
rt
m/z
12Raw HPLC-MS map
intensity
Features
But whatwas it?
rt
m/z
13HPLC ESI QTOF MS ID
ESI-QTOF-MS
HPLC
MS2-Spectrum
i
Identification
KGFSPDGR
m/z
14Interpretation of tandem MS spectra
y
7
800
700
600
M2H2
500
y
3
b
-
2
(
H
O
)
5
2
Intensity
400
300
y
6
b
-
H
O
4
2
y
9
739.34
200
S
V
I/L
S
y
8
b
-
H
O
5
2
y
1
y
1
0
2
9
4
.
2
b
-
H
O
6
2
y
100
4
b
-
H
O
1120.60
2
2
y
m/z
5
b
-
H
O
y
3
2
2
a
2
b
b
b
5
b
b
4
1
2
3
y
1
1
800
200
400
600
1200
1000
15Overview / Glossary
- HPLC High-performance liquid chromatographyPep
tides elute at different retention times
(RT)from a chromatographic column - MS Mass spectrometryPeptides have different
mass/charge ratios (m/z)in a mass spectrum - MS/MS Tandem mass spectrometrySelected ions
in a mass spec can be furtherfragmented gt
derived mass spectrum
16 an open source software- framework for shotgun
proteomics
- OpenMS is a C library that provides solutions
for many tasks in proteomics data processing - Efficient data structures
- E.g., support for external memory (1 LC/MS map gt
1 GB) - D-dimensional kernel
- New algorithms for
- Signal processing of raw MS data
- Feature finding
- Superposition
- Identification
- Standard file formats, relational database
support - Visualization
17 an open source software- framework for shotgun
proteomics
- Design goals
- Extensibility
- Template code allows flexible and efficient reuse
- Interoperability
- Import/export of standard MS formats
- Robustness
- Black box unit testing
- Automated test builds on various platforms and
architectures - Usability
- HTML documentation of all classes
- Tutorial with examples
- Consistent coding style
18TOPP - Motivation
- OpenMS is ...
- very powerful
19TOPP - Motivation
- OpenMS is ...
- very powerful
- only usable for real programmers
20TOPP design goals
- Inspired by EMBOSS
- one application for each frequently used analysis
step - functionality of OpenMS as easy-to-use standalone
tools - All applications share identical user
interfacesgt Easy integration into workflow
systems - Uses PSI standard formats (mzData, mzXML,
analysisXML) - Can use a common XML configuration file
- Requirements only familiarity with UNIX/Linux
systems - Comprehensive HTML documentation (Doxygen)
- GUI TOPPView
21TOPP tools
Quantitation
File Handling
Signal Processing
FeatureFinder
FileConverter
NoiseFilter
FileInfo
BaselineFilter
Identification
FileFilter
PeakPicker
RTModel
FileMerger
SpectrumFilter
RTPredict
DTAExtractor
MascotAdapter
InspectAdapter
Analysis
Map Alignment
IDFilter
AdditiveSeries
UnlabeledMatcher
TOPPView
MapMatcher
Isotope Labeling
MapStatistics
LabeledMatcher
Dewarper
22TOPP tools
Quantitation
File Handling
Signal Processing
FeatureFinder
FileConverter
NoiseFilter
FileInfo
BaselineFilter
Identification
FileFilter
PeakPicker
RTModel
FileMerger
SpectrumFilter
RTPredict
DTAExtractor
MascotAdapter
InspectAdapter
Analysis
Map Alignment
IDFilter
AdditiveSeries
UnlabeledMatcher
TOPPView
MapMatcher
Isotope Labeling
MapStatistics
LabeledMatcher
Dewarper
23TOPP tools
Quantitation
File Handling
Signal Processing
FeatureFinder
FileConverter
NoiseFilter
FileInfo
BaselineFilter
Identification
FileFilter
PeakPicker
RTModel
FileMerger
SpectrumFilter
RTPredict
DTAExtractor
MascotAdapter
InspectAdapter
Analysis
Map Alignment
IDFilter
AdditiveSeries
UnlabeledMatcher
TOPPView
MapMatcher
Isotope Labeling
MapStatistics
LabeledMatcher
Dewarper
24TOPP tools
Quantitation
File Handling
Signal Processing
FeatureFinder
FileConverter
NoiseFilter
FileInfo
BaselineFilter
Identification
FileFilter
PeakPicker
RTModel
FileMerger
SpectrumFilter
RTPredict
DTAExtractor
MascotAdapter
InspectAdapter
Analysis
Map Alignment
IDFilter
AdditiveSeries
UnlabeledMatcher
TOPPView
MapMatcher
Isotope Labeling
MapStatistics
LabeledMatcher
Dewarper
25TOPP tools
Quantitation
File Handling
Signal Processing
FeatureFinder
FileConverter
NoiseFilter
FileInfo
BaselineFilter
Identification
FileFilter
PeakPicker
RTModel
FileMerger
SpectrumFilter
RTPredict
DTAExtractor
MascotAdapter
InspectAdapter
Analysis
Map Alignment
IDFilter
AdditiveSeries
UnlabeledMatcher
TOPPView
MapMatcher
Isotope Labeling
MapStatistics
LabeledMatcher
Dewarper
26TOPP tools
Quantitation
File Handling
Signal Processing
FeatureFinder
FileConverter
NoiseFilter
FileInfo
BaselineFilter
Identification
FileFilter
PeakPicker
RTModel
FileMerger
SpectrumFilter
RTPredict
DTAExtractor
MascotAdapter
InspectAdapter
Analysis
Map Alignment
IDFilter
AdditiveSeries
UnlabeledMatcher
TOPPView
MapMatcher
Isotope Labeling
MapStatistics
LabeledMatcher
Dewarper
27TOPP tools
Quantitation
File Handling
Signal Processing
FeatureFinder
FileConverter
NoiseFilter
FileInfo
BaselineFilter
Identification
FileFilter
PeakPicker
RTModel
FileMerger
SpectrumFilter
RTPredict
DTAExtractor
MascotAdapter
InspectAdapter
Analysis
Map Alignment
IDFilter
AdditiveSeries
UnlabeledMatcher
TOPPView
MapMatcher
Isotope Labeling
MapStatistics
LabeledMatcher
Dewarper
28TOPP tools
Quantitation
File Handling
Signal Processing
FeatureFinder
FileConverter
NoiseFilter
FileInfo
BaselineFilter
Identification
FileFilter
PeakPicker
RTModel
FileMerger
SpectrumFilter
RTPredict
DTAExtractor
MascotAdapter
InspectAdapter
Analysis
Map Alignment
IDFilter
AdditiveSeries
UnlabeledMatcher
TOPPView
MapMatcher
Isotope Labeling
MapStatistics
LabeledMatcher
Dewarper
29Example PipelinePeptide identification using
LCMSMS
Intermediate steps
Output
Input
HPLC-MS(-MS) raw data
reliable protein/peptide identifications
30Example Pipeline Identification
Intermediate steps
Output
Input
HPLC-MS(-MS) raw data
MS-MS raw data
extraction of tandem-MS spectra
31Example Pipeline Identification
Intermediate steps
Output
Input
HPLC-MS(-MS) raw data
MS-MS raw data
noise filtering
smoothed MS-MS raw data
32Example Pipeline Identification
Intermediate steps
Output
Input
HPLC-MS(-MS) raw data
MS-MS raw data
smoothed MS-MS raw data
peak picking
MS-MS peak data
33Peak picking using wavelet techniques
Original spectrum
local maxima peak centroids
Spectrum filtered with Marr wavelet
Picked peak spectum
RT, m/z, intensity,FWHM, skew, quality, ...
34Example Pipeline Identification
Intermediate steps
Output
Input
HPLC-MS(-MS) raw data
MS-MS raw data
smoothed MS-MS raw data
MS-MS peak data
identification using
protein/peptide identifications
35Example Pipeline Identification
Intermediate steps
Output
Input
HPLC-MS(-MS) raw data
MS-MS raw data
smoothed MS-MS raw data
MS-MS peak data
reliable protein/peptide identifications
protein/peptide identifications
filtering of identifications
36Example Pipeline Identification
Quantitation
File Handling
Signal Processing
FeatureFinder
FileConverter
NoiseFilter
FileInfo
BaselineFilter
Identification
FileFilter
PeakPicker
RTModel
FileMerger
SpectrumFilter
RTPredict
DTAExtractor
MascotAdapter
Map Alignment
InspectAdapter
Analysis
UnlabeledMatcher
IDFilter
AdditiveSeries
MapMatcher
TOPPView
Isotope Labeling
Dewarper
MapStatistics
LabeledMatcher
37Example Pipeline Identification
Quantitation
File Handling
Signal Processing
FeatureFinder
FileConverter
NoiseFilter
FileInfo
BaselineFilter
Identification
FileFilter
PeakPicker
RTModel
FileMerger
SpectrumFilter
RTPredict
DTAExtractor
MascotAdapter
Map Alignment
InspectAdapter
Analysis
UnlabeledMatcher
IDFilter
AdditiveSeries
MapMatcher
TOPPView
Isotope Labeling
Dewarper
MapStatistics
LabeledMatcher
38Example Pipeline Identification
Quantitation
File Handling
Signal Processing
FeatureFinder
FileConverter
NoiseFilter
FileInfo
BaselineFilter
Identification
FileFilter
PeakPicker
RTModel
FileMerger
SpectrumFilter
RTPredict
DTAExtractor
MascotAdapter
Map Alignment
InspectAdapter
Analysis
UnlabeledMatcher
IDFilter
AdditiveSeries
MapMatcher
TOPPView
Isotope Labeling
Dewarper
MapStatistics
LabeledMatcher
39Example Pipeline Identification
Quantitation
File Handling
Signal Processing
FeatureFinder
FileConverter
NoiseFilter
FileInfo
BaselineFilter
Identification
FileFilter
PeakPicker
RTModel
FileMerger
SpectrumFilter
RTPredict
DTAExtractor
MascotAdapter
Map Alignment
InspectAdapter
Analysis
UnlabeledMatcher
IDFilter
AdditiveSeries
MapMatcher
TOPPView
Isotope Labeling
Dewarper
MapStatistics
LabeledMatcher
40Example Pipeline Identification
Quantitation
File Handling
Signal Processing
FeatureFinder
FileConverter
NoiseFilter
FileInfo
BaselineFilter
Identification
FileFilter
PeakPicker
RTModel
FileMerger
SpectrumFilter
RTPredict
DTAExtractor
MascotAdapter
Map Alignment
InspectAdapter
Analysis
UnlabeledMatcher
IDFilter
AdditiveSeries
MapMatcher
TOPPView
Isotope Labeling
Dewarper
MapStatistics
LabeledMatcher
41Example Pipeline Identification
Quantitation
File Handling
Signal Processing
FeatureFinder
FileConverter
NoiseFilter
FileInfo
BaselineFilter
Identification
FileFilter
PeakPicker
RTModel
FileMerger
SpectrumFilter
RTPredict
DTAExtractor
MascotAdapter
Map Alignment
InspectAdapter
Analysis
UnlabeledMatcher
IDFilter
AdditiveSeries
MapMatcher
TOPPView
Isotope Labeling
Dewarper
MapStatistics
LabeledMatcher
42Example Pipeline Identification
HPLC-MS(-MS) raw data
MS-MS raw data
FileFilter
mzData
mzData
NoiseFilter
MS-MS peak data
smoothed MS-MS raw data
PeakPicker
mzData
mzData
InspectAdapter
reliable protein/peptide identifications
protein/peptide identifications
IDFilter
analysisXML
analysisXML
43Example Pipeline Identification
- The TOPP tools can be combinedto workflows in
many ways(E.g., Makefiles, shell scripts,
complex workflow systems) - For simplicity, we shall use shell scripts in
this example.
44Example Pipeline Identification
id.sh FileFilter -in raw.mzData
-out tandem_ms.mzData -level 2 NoiseFilter
-in tandem_ms.mzData -out smoothed.mzData
-ini id.ini PeakPicker -in smoothed.mzData
-out peaks.mzData -ini id.ini InspectAdapte
r -in peaks.mzData -out id.analysisXML
-ini id.ini IDFilter -in id.analysisXML
-out result.analysisXML -ini id.ini
45Example Pipeline Identification
id.sh FileFilter -in raw.mzData
-out tandem_ms.mzData -level 2 NoiseFilter
-in tandem_ms.mzData -out smoothed.mzData
-ini id.ini PeakPicker -in smoothed.mzData
-out peaks.mzData -ini id.ini InspectAdapte
r -in peaks.mzData -out id.analysisXML
-ini id.ini IDFilter -in id.analysisXML
-out result.analysisXML -ini id.ini
id.ini ltPARAMETERSgt ... ltNODE
name"InspectAdapter" gt ltITEM name"protease"
value"Trypsin" typestring /gt ltITEM
name"PM_tolerance" value"1.0"
typefloat /gt ltITEM name"ion_tolerance"
value"0.3" typefloat /gt lt/NODEgt
... lt/PARAMETERSgt
46Myoglobin as diagnostic marker
- Myoglobin
- 17 kDa protein
- stores oxygen in skeletal and heart muscle
- released in serum after a myocardial infarct
- important parameter for blood re-circulation
after thrombolytic therapy - healthy people 30-90 ng/mLdiseased gt
100-1000 ng/ml
47Additive Method
intensity
measurements
48Example Pipeline Quantification
Quantitation
File Handling
Signal Processing
FeatureFinder
FileConverter
NoiseFilter
FileInfo
BaselineFilter
Identification
FileFilter
PeakPicker
RTModel
FileMerger
SpectrumFilter
RTPredict
DTAExtractor
MascotAdapter
Map Alignment
InspectAdapter
Analysis
UnlabeledMatcher
IDFilter
AdditiveSeries
MapMatcher
TOPPView
Isotope Labeling
Dewarper
MapStatistics
LabeledMatcher
49Example Pipeline Quantification
quant.sh Pipeline for Myoglobin Absolute
Quantitation Find features in all 32
individual maps. for i in seq 1 32 do
Truncate raw data maps (to save time).
FileFilter -ini AddSeries.ini -n i
Collect peptide features. FeatureFinder -ini
AddSeries.ini -n i done Star-like matching
(31 edges). for i in seq 1 32 do Map
features across different maps.
UnlabeledMatcher -ini AddSeries.ini -n i
MapMatcher -ini AddSeries.ini -n i Dewarper
-ini AddSeries.ini -n i done Compute final
concentration (lin. regression). AdditiveSeries
-ini AddSeries.ini
50Example Pipeline Quantification
quant.sh Pipeline for Myoglobin Absolute
Quantitation Find features in all 32
individual maps. for i in seq 1 32 do
Truncate raw data maps (to save time).
FileFilter -ini AddSeries.ini -n i
Collect peptide features. FeatureFinder -ini
AddSeries.ini -n i done Star-like matching
(31 edges). for i in seq 1 32 do Map
features across different maps.
UnlabeledMatcher -ini AddSeries.ini -n i
MapMatcher -ini AddSeries.ini -n i Dewarper
-ini AddSeries.ini -n i done Compute final
concentration (lin. regression). AdditiveSeries
-ini AddSeries.ini
51Feature Finding in LC/MS raw data
- Fit a two-dimensional model to selected
regionsof the LC/MS map
52Raw Map ? Feature Map
Feature Finding reduces the volume of data by
several orders of magnitude
53Example Pipeline Quantification
quant.sh Pipeline for Myoglobin Absolute
Quantitation Find features in all 32
individual maps. for i in seq 1 32 do
Truncate raw data maps (to save time).
FileFilter -ini AddSeries.ini -n i
Collect peptide features. FeatureFinder -ini
AddSeries.ini -n i done Star-like matching
(31 edges). for i in seq 1 32 do Map
features across different maps.
UnlabeledMatcher -ini AddSeries.ini -n i
MapMatcher -ini AddSeries.ini -n i Dewarper
-ini AddSeries.ini -n i done Compute final
concentration (lin. regression). AdditiveSeries
-ini AddSeries.ini
54Direct differential quantitation
Assign pairs across two or more maps
- RT of peptide may vary between maps
- compute suitable mapping by pose clustering
55Example Pipeline Quantification
quant.sh Pipeline for Myoglobin Absolute
Quantitation Find features in all 32
individual maps. for i in seq 1 32 do
Truncate raw data maps (to save time).
FileFilter -ini AddSeries.ini -n i
Collect peptide features. FeatureFinder -ini
AddSeries.ini -n i done Star-like matching
(31 edges). for i in seq 1 32 do Map
features across different maps.
UnlabeledMatcher -ini AddSeries.ini -n i
MapMatcher -ini AddSeries.ini -n i Dewarper
-ini AddSeries.ini -n i done Compute final
concentration (lin. regression). AdditiveSeries
-ini AddSeries.ini
56Results
- automated
- Ca. 2 min (CompLife '05)
T11hu HGATVLTALGGILK with IS T10ho
9.0E-01
8.0E-01
7.0E-01
6.0E-01
5.0E-01
relative peak area
4.0E-01
3.0E-01
2.0E-01
1.0E-01
0.0E00
0
0.5
1
1.5
2
2.5
3
concentration ng/µl
0.382 ng/ul (0.31-0.45)
0.48 ng/ul (0.42-0.55)
Expected value 0.47 ng/µl myoglobin
57Availability
- Lesser GNU public license (LGPL)
- Currently runs under
- Linux
- (Mac OS X)
- Other platforms will follow
- TOPP is hosted at SourceForge as part of OpenMS
- Latest version 0.95, released for ECCB '06
- Project web page www.openms.de
58Summary
- TOPP is a set of tools covering a wide range of
frequently used data analysis steps in LC/MS
based proteomics - Flexible
- one tool for each different task
- common interfaces and standard file formats,
easily integrated into workflow systems - built upon OpenMS
- bridges the gap between algorithm designers and
experimenters
59The OpenMS Team
Oliver Kohlbacher
Andreas Bertsch
Nico Pfeifer
Marc Sturm
Instr. Analysis and Bioanalysis Saarland
University
60that's all, essentially
61Appendix
62Peak picking using wavelet techniques
63Tandem Mass Spectrometry
- Trap certain ions inside a mass spectrometer
- Ions are further fragmented, e.g. by collision
with a noble gas - Analyze the derived ions in another mass
spectrometer - gt Tandem MS
64Tandem Mass Spectra
- Most frequently observed fragmentsb- and y-ions
- But other bondscan break as wella-, b-, c-
andx-, y-, z- ions - Side chain reactions ...
- Neutral losses ...
- Noise ...
65Tandem Mass Spectra
b2-H2O
b3- NH3
b2
b3
a2
a3
HO
NH3
R1 O R2 O
R3 O R4
H -- N --- C --- C --- N --- C
--- C --- N --- C --- C --- N --- C -- COOH
H H
H H H H H
y3
y2
y1
y2 - NH3
y3 -H2O
66Average isotopic distribution vs. mass of peptide
67Feature Models m/z
68Feature Models RT
- Elution profiles
- Can be modeled by a normal distribution or a
skewed distribution (EMG, log-normal,...)
69Feature Finding
feature model
isotope pattern
elution profile
RT
m/z
70Feature finding algorithm
- The algorithm for feature finding consists of
- four main phases
- Seeding Choose a starting point
- Extension Find a surrounding region (maybe too
large) - Modeling Fit a two-dimensional model to the
data. - Adjusting. Retain only those data points that are
compatible with the model
71Pair Matching
- Derive empirical score from the 2D distance
distribution of manually annotated pairs (right) - Search for matching features for a given feature
(left, red) in a bounding box (left, blue) - Score pairs with respect to p-value
- Use greedy algorithm to assign pairs