Title: Preprocessing and Biomarker Extraction Methods in Mass Spectrometry
1Preprocessing and Biomarker Extraction Methods in
Mass Spectrometry
- Julien PRADOS
- Centre Universitaire dInformatique
- Université de Genève
- Geneva Artificial Intelligence Laboratory
- 15-06-2007
2Introduction(1/2)Mass Spectrometry (MS)
- SELDI-MS is a technology able to reflect on a
signal (mass spectrum) expression of proteins
present in a specimen - Intensity (area) of peaks are linked to quantity
of protein in the specimen - x-axis Mass (/Charge) of proteins
- y-axis Number of protein with mass x
Protein A Mass 4770.7Da
Specimen 1 Positive
Protein B Mass 9226.5Da
Specimen 2 Negative
3Introduction(2/2)MS Learning
- Goals
- Diagnostic recognize the group of a specimen
from his mass spectrum (classification). - Biomarker extraction find a small number of
protein whose expression allow the diagnostic the
specimens (feature selection). - General Context knowledge extraction from
signals ? potential impacts on - Image analysis, Video analysis, Audio analysis,
Sismologic data, Cardiac data, Climatic data,
Financial data, ... All numeric devices with
captors
4OutlinePhD Contributions
- Preprocessing of Mass Spectra
- Peak Picking
- Peak Alignment
- Reproducibility of MS technology
- Preprocessing Evaluation
- Learning (Feature Selection)
- feature selection stability
- logRatio kernel for SVM base FS
5OutlinePhD Contributions
- Preprocessing of Mass Spectra
- Peak Picking
- Peak Alignment
- Reproducibility of MS technology
- Preprocessing Evaluation
- Learning (Feature Selection)
- feature selection stability
- logRatio kernel for SVM base FS
6Preprocessing (1/6)A General ML Issue
- Preliminary step of learning whose goals are
- structure the data
- clean the information and keep only useful one
- integrate domain knowledge
- choice of a representation
- Included in every machine learning problems, it
has an heavy impact on learning performances and
takes most of the data miner time - ? Many current research try indirectly to
faciltate preprocessing - learning from complex structure (graphs, trees)
- kernel methods change the data representation
just by changing a function
7Preprocessing (2/6)MS Data Preprocessing
- Ideal dataset is protein expression matrix
matrix of concentration of proteins in specimens - Its construction from mass spectra involves
several steps - Peak picking, peak alignment, baseline
correction, peak area estimation, noise level
estimation
8Preprocessing (3/6)Peak Picking (1/2)
- CBMS 06
- Based on a relative peak characterization and
valley notion - Valley of a point p is the minimum intensity
point located on the left and right of p s.t.
their is no point with intensity greater than p
between them. - Peak Points with deepest valleys
Valley depth
9Preprocessing (4/6)Peak Picking (2/2)
- Benefits
- Parameter less
- Provide a confidence in peaks
- No assumption on peak width
- Robust to baseline correction
- Doesnt require signal smoothing
- Extension to 2D peak detection (LCMS images)
even nD peak detection (object detection in
videos) - Fast O(n)
- Drawback
- Cannot resolve overlapping peaks
10Preprocessing (5/6)MS Experiment Alignment
- Done by agglomerative hierarchical clustering
with constraints - at each iteration, merge closest clusters
- except if the new cluster contains two peaks from
the same spectrum - or if peak error in the new cluster exceed a
maximum value - Two strategies for missing values
- set missing values to zero because there is no
peak - retrieve signal intensity in spectra (not obvious
for peak areas)
11Preprocessing (6/6)MS Preprocessing
Generalisation
- Peak Picking analyse individually each instance
and identify the elements in the instances - Peak Alignment identify elements of the
different instances related to the same
information
12OutlinePhD Contributions
- Preprocessing of Mass Spectra
- Peak Picking
- Peak Alignment
- Reproducibility of MS technology
- Preprocessing Evaluation
- Learning (Feature Selection)
- feature selection stability
- logRatio kernel for SVM base FS
13SELDI-TOF Reproducibility (1/6)Overview
- The Fact when a experiment is repeated several
times, the mass spectra look different. - General problem in MS e.g. over 3 LCMS
experiments only 20 of protein are identified in
common ! - JChromatB06 SELDI-TOF Reproducibility
- How reproducible is SELDI technology ?
- How experimental protocol affects reproducibility
? - How similar are protocols ?
14SELDI-TOF Reproducibility (2/6) Experimental
Setup
- 1 urine sample (pool of 4 individuals)
- 8 experimental protocols with different
- urine dilution
- matrices
- ultra filtration methods
- for each experimental condition 50 spectra 10 x
5 replicates
15SELDI-TOF Reproducibility (3/6)Peak Picking and
Alignment
8 x
16SELDI-TOF Reproducibility (4/6) Results
- Estimate pourcentage of peaks found in at least
10/10, 9/10, 8/10 ... 1/10 replicates
17SELDI-TOF Reproducibility (5/6) Protocol
Complementarity
- How similar are protocols ? how do they
complement themselves ? - ? Hierarchical clustering of methods using
Tanimoto distance between sets of peaks found
18SELDI-TOF Reproducibility (6/6) Final Comment
- Reproducibility is a necessary condition but is
not sufficient its easy to define a very
reproducible peak picking but it would extract
irrelevant information for the diagnostic problem - ? we need also to assert the quality of the
preprocessed information
19OutlinePhD Contributions
- Preprocessing of Mass Spectra
- Peak Picking
- Peak Alignment
- Reproducibility of MS technology
- Preprocessing Evaluation
- Learning (Feature Selection)
- feature selection stability
- logRatio kernel for SVM base FS
20Preprocessing Evaluation (1/1)Choosing a Data
Representation
- CBMS06
- Loss of information in preprocessing ?
- How the different representations affect
information content ? - ? Compare classification performances of
- SVM Support Vector Machines (SMO)
- Decision Tree (J48)
- Nearest Neighbor (IBk)
- on datasets
- Stroke (Stk)
- Prostate Cancer (Pro)
- Ovarian Cancer (Ova)
- for different preprocessing options
- no preprocessing (raw)
- peak intensity retrieve missing values from raw
data (is) - peak intensity zero for missing values (iz)
- peak area zero for missing values (az)
Classification error
21OutlinePhD Contributions
- Preprocessing of Mass Spectra
- Peak Picking
- Peak Alignment
- Reproducibility of MS technology
- Preprocessing Evaluation
- Learning (Feature Selection)
- feature selection stability
- logRatio kernel for SVM base FS
22Learning - FS (1/10)Problem Presentation (1/1)
- In preprocessing we do our best to reduce the
information and to focus on relevant one ignoring
groups of specimens (unsupervised) - For biomarker extraction task, the learning goal
is to further reduce the information considering
class labels of the specimens (supervised FS)
23Learning - FS (2/10)Feature Selection Stability
(1/3)
- In biomarker extraction from MS stability of
selected features is also capital - it is preferable that a FS method select same
biomarkers when little data perturbations are
introduced - ICDM05, KIS07 Introduces a framework for
estimation of FS stability - apply FS in different folds of a CV to guaranty
22 of non overlapping instances - Estimate mean of attribute overlap in the 10
selected feature sets
24Learning - FS (3/10)Feature Selection Stability
(2/3)
- With this framework stability of FS methods can
be compared - Information Gain
- ReliefF
- linear-SVMRFE
- linear-SVM (no RFE)
- Random FS
- Random FS can be computed theoretically
- Results on MS
- RF gt SVMONE gt SVMRFE gt IG gt Random
25Learning - FS (4/10)Feature Selection Stability
(3/3)
- Stability is a necessary condition but is not
sufficient predictive power of selected feature
has to be verify - Results show a general superiority of
performances with SVM based feature sets - FS Stability Performance ? Concentrate on SVM
based FS
26Learning - FS (5/10)LogRatio Kernel (1/6)
?(X)
?(X)1,3
log(X)
1
3
n
2
- Biological motivations
- Log of Ratio often used to compare expressions
(genomic) - Pair-wise combination focus learning on
protein-protein interactions - Insensitive to instance normalization
- Similar to polynomial kernel of degree 2, but
instead of pair wise products of attributes,
LogRatio feature space consider logarithm of
their ratios - The n2 components of ?(x) can be visualized on a
plot with n bars - LogRatio centring makes it also insensitive to
attribute normalization
27Learning - FS (6/10)LogRatio Kernel (2/6)
- SVM decision function can be simplified like in
linear SVM case - Vector W store the model like in linear SVM case
- ?(W) is a normal vector to separating hyperplan
in the feature space - ? attributes can be ranked according to ?(W)
28Learning - FS (7/10) LogRatio Kernel (3/6)
- Less significant attribute the one with median
log(W) value because it minimizes absolute
distance to all others - ? Rank attributes according to their distance to
median - Perform RFE to improve ranking ? logRatio-SVMRFE
?(W)
log(W)
1
3
n
2
29Learning - FS (8/10) LogRatio Kernel (4/6)
- logRatio-SVMRFE appears good in removing
redundant attributes because of the pair-wise
attribute combination - Removing Redundant Attribute is not an easy
problem - Attributes which express the same information in
different units can be safely removed - In presence of noise, attributes which reflects n
measures of the same information are
complementary (by computing their means, we
improve precision of the measure) - Redundancy is an important concept in expression
analysis it is a sign of protein interaction - logRatio-SVMRFE can distinguish between the two
type of redundancy
30Learning - FS (9/10) LogRatio Kernel (5/6)
- Data Iris dataset augmented by duplication of
its attributes in order to express each value in
cm mm - logRatio-SVMRFE rank first one of each
information (petal/sepal length/width) - linear-SVMRFE rank 2 times the same information
first once express in cm once in mm - Noise added only on cm attributes
- logRatio-SVMRFE rank first all mm attributes
- linear-SVMRFE rank first cm mm attributes
- ? Perfect behaviour of logRatio-SVMRFE
31Learning - FS (10/10) LogRatio Kernel (6/6)
- Concerning classification performances,
logRatio-SVM classifier performs similar to
linear-SVM for MS dataset, and better on
text-mining. - Essentially because of log transformation
- But logRatio-SVMRFE seems less good than
linear-SVMRFE in terms of stability selected
attributes - inadequate model interpretation ?
- more RFE step should be included ?
32Conclusions
- Facilitate and Reduce MS preprocessing effort
- FS method focusing on attribute redundancy and
protein-protein interactions - Relative data representation helps in reaching
those goals - peak picking
- logRatio kernel feature selection
- The role of stability in data reduction and
simplification - Preprocessing
- Learning
33Perspectives
- Extensions of preprocessing to LCMS data
- Harder Alignment Problems
- More Complex Reproducibility Problems
- Needs for Real Time Processing Learning
- Further integration of preprocessing in learning
- We investigated tuning preprocessing parameters
considering classification performances Pro04 - preprocessing should not be separated from
learning investigate integration of unsupervised
methods and supervised methods to automatically
preprocess information considering the
classification objective - Is it possible to extend logRatio kernel s.t. the
feature space can be visualized in 2D (or nD)
space instead of 1D ? This would bring enhanced
information on attribute redundancy.
34PhD Status
- Problems
- No article on logRatio kernel
- Additional experiments
- Compare preprocessing to other one
- PhD Writting
- Estimation 1/3 written (no corrections included)
- Oral Exam
- Subject Error Evaluation ? ICLP ?
- Date Sep ?
- PhD Defense
- Expected Date End 2007
35References
- KIS07 A. Kalousis, J. Prados, and M. Hilario.
Stability of feature selection algorithms.
Knowledge and Information Systems,
2007.JChromatB P. Zerefos, J. Prados, S.
Garbis, A. Kalousis and A. Vlahou. Urinary
protein profiling by MALDI-TOF-MS Tackling
sample preparation and bioinformatics issues.
Journal of Chromatography B, 2006.CBMS06 J.
Prados, A. Kalousis, and M. Hilario. On
Preprocessing of SELDI-MS Data and its
Evaluation. CBMS, 2006. talkICDM05 A.
Kalousis, J. Prados, and M. Hilario. Stability of
feature selection algorithms. ICDM,
2005.PKDD05 A. Kalousis, J. Prados, E.
Rexhepaj, and M. Hilario. Feature extraction from
mass spectra for classification. In 6th European
Conference on Principles and Practice of
Knowledge Discovery in Databases, 2005.6 M.
Hilario, A. Kalousis, J. Prados, and P. A. Binz.
Data mining for mass spectra-based cancer
diagnosis and biomarker discovery. Drug Discovery
Today Biosilico, 2(5)214-222, September
2004.ICTAI04 A. Kalousis, J. Prados, J. C.
Sanchez, L. Allard, and M. Hilario. Distilling
classification models from cross-validation runs
an application to mass spectrometry. ICTAI, 2004.
talkPro04 J. Prados, A. Kalousis, L. Allard,
O. Carrette, J. C. Sanchez, and M. Hilario.
Mining mass-spectra for diagnosis and biomarker
discovery of cerebral accidents. Proteomics,
42320-2332, 2004.
36PreprocessingPeak Picking Peak Area(1)
- Definition The left valley, vl of a point p is
the minimum intensity point located on the left
of p s.t. their is no point with intensity
greater than p between vl and p. (idem for right) - Peak points with high valley depth
37PreprocessingPeak Picking Peak Area(2)
- Peak Area
- approximated by the area of the triangle obtained
after fitting of piecewise linear models in two
parts - Benefits
- Parameter less
- Feed back on confidence in a peak
- No assumption on peak width
- Doesnt require baseline correction
- Doesnt require signal smoothing
- Extension to 2D peak detection possible (LCMS)
- Fast O(n)
- Drawback
- Cannot resolve overlapping peaks