Preprocessing and Biomarker Extraction Methods in Mass Spectrometry PowerPoint PPT Presentation

presentation player overlay
1 / 35
About This Presentation
Transcript and Presenter's Notes

Title: Preprocessing and Biomarker Extraction Methods in Mass Spectrometry


1
Preprocessing and Biomarker Extraction Methods in
Mass Spectrometry
  • Julien PRADOS
  • Centre Universitaire dInformatique
  • Université de Genève
  • Geneva Artificial Intelligence Laboratory
  • 15-06-2007

2
Introduction(1/2)Mass Spectrometry (MS)
  • SELDI-MS is a technology able to reflect on a
    signal (mass spectrum) expression of proteins
    present in a specimen
  • Intensity (area) of peaks are linked to quantity
    of protein in the specimen
  • x-axis Mass (/Charge) of proteins
  • y-axis Number of protein with mass x


Protein A Mass 4770.7Da
Specimen 1 Positive
Protein B Mass 9226.5Da
Specimen 2 Negative
3
Introduction(2/2)MS Learning
  • Goals
  • Diagnostic recognize the group of a specimen
    from his mass spectrum (classification).
  • Biomarker extraction find a small number of
    protein whose expression allow the diagnostic the
    specimens (feature selection).
  • General Context knowledge extraction from
    signals ? potential impacts on
  • Image analysis, Video analysis, Audio analysis,
    Sismologic data, Cardiac data, Climatic data,
    Financial data, ... All numeric devices with
    captors

4
OutlinePhD Contributions
  • Preprocessing of Mass Spectra
  • Peak Picking
  • Peak Alignment
  • Reproducibility of MS technology
  • Preprocessing Evaluation
  • Learning (Feature Selection)
  • feature selection stability
  • logRatio kernel for SVM base FS

5
OutlinePhD Contributions
  • Preprocessing of Mass Spectra
  • Peak Picking
  • Peak Alignment
  • Reproducibility of MS technology
  • Preprocessing Evaluation
  • Learning (Feature Selection)
  • feature selection stability
  • logRatio kernel for SVM base FS

6
Preprocessing (1/6)A General ML Issue
  • Preliminary step of learning whose goals are
  • structure the data
  • clean the information and keep only useful one
  • integrate domain knowledge
  • choice of a representation
  • Included in every machine learning problems, it
    has an heavy impact on learning performances and
    takes most of the data miner time
  • ? Many current research try indirectly to
    faciltate preprocessing
  • learning from complex structure (graphs, trees)
  • kernel methods change the data representation
    just by changing a function

7
Preprocessing (2/6)MS Data Preprocessing
  • Ideal dataset is protein expression matrix
    matrix of concentration of proteins in specimens
  • Its construction from mass spectra involves
    several steps
  • Peak picking, peak alignment, baseline
    correction, peak area estimation, noise level
    estimation

8
Preprocessing (3/6)Peak Picking (1/2)
  • CBMS 06
  • Based on a relative peak characterization and
    valley notion
  • Valley of a point p is the minimum intensity
    point located on the left and right of p s.t.
    their is no point with intensity greater than p
    between them.
  • Peak Points with deepest valleys

Valley depth
9
Preprocessing (4/6)Peak Picking (2/2)
  • Benefits
  • Parameter less
  • Provide a confidence in peaks
  • No assumption on peak width
  • Robust to baseline correction
  • Doesnt require signal smoothing
  • Extension to 2D peak detection (LCMS images)
    even nD peak detection (object detection in
    videos)
  • Fast O(n)
  • Drawback
  • Cannot resolve overlapping peaks

10
Preprocessing (5/6)MS Experiment Alignment
  • Done by agglomerative hierarchical clustering
    with constraints
  • at each iteration, merge closest clusters
  • except if the new cluster contains two peaks from
    the same spectrum
  • or if peak error in the new cluster exceed a
    maximum value
  • Two strategies for missing values
  • set missing values to zero because there is no
    peak
  • retrieve signal intensity in spectra (not obvious
    for peak areas)

11
Preprocessing (6/6)MS Preprocessing
Generalisation
  • Peak Picking analyse individually each instance
    and identify the elements in the instances
  • Peak Alignment identify elements of the
    different instances related to the same
    information

12
OutlinePhD Contributions
  • Preprocessing of Mass Spectra
  • Peak Picking
  • Peak Alignment
  • Reproducibility of MS technology
  • Preprocessing Evaluation
  • Learning (Feature Selection)
  • feature selection stability
  • logRatio kernel for SVM base FS

13
SELDI-TOF Reproducibility (1/6)Overview
  • The Fact when a experiment is repeated several
    times, the mass spectra look different.
  • General problem in MS e.g. over 3 LCMS
    experiments only 20 of protein are identified in
    common !
  • JChromatB06 SELDI-TOF Reproducibility
  • How reproducible is SELDI technology ?
  • How experimental protocol affects reproducibility
    ?
  • How similar are protocols ?

14
SELDI-TOF Reproducibility (2/6) Experimental
Setup
  • 1 urine sample (pool of 4 individuals)
  • 8 experimental protocols with different
  • urine dilution
  • matrices
  • ultra filtration methods
  • for each experimental condition 50 spectra 10 x
    5 replicates

15
SELDI-TOF Reproducibility (3/6)Peak Picking and
Alignment
8 x
16
SELDI-TOF Reproducibility (4/6) Results
  • Estimate pourcentage of peaks found in at least
    10/10, 9/10, 8/10 ... 1/10 replicates

17
SELDI-TOF Reproducibility (5/6) Protocol
Complementarity
  • How similar are protocols ? how do they
    complement themselves ?
  • ? Hierarchical clustering of methods using
    Tanimoto distance between sets of peaks found

18
SELDI-TOF Reproducibility (6/6) Final Comment
  • Reproducibility is a necessary condition but is
    not sufficient its easy to define a very
    reproducible peak picking but it would extract
    irrelevant information for the diagnostic problem
  • ? we need also to assert the quality of the
    preprocessed information

19
OutlinePhD Contributions
  • Preprocessing of Mass Spectra
  • Peak Picking
  • Peak Alignment
  • Reproducibility of MS technology
  • Preprocessing Evaluation
  • Learning (Feature Selection)
  • feature selection stability
  • logRatio kernel for SVM base FS

20
Preprocessing Evaluation (1/1)Choosing a Data
Representation
  • CBMS06
  • Loss of information in preprocessing ?
  • How the different representations affect
    information content ?
  • ? Compare classification performances of
  • SVM Support Vector Machines (SMO)
  • Decision Tree (J48)
  • Nearest Neighbor (IBk)
  • on datasets
  • Stroke (Stk)
  • Prostate Cancer (Pro)
  • Ovarian Cancer (Ova)
  • for different preprocessing options
  • no preprocessing (raw)
  • peak intensity retrieve missing values from raw
    data (is)
  • peak intensity zero for missing values (iz)
  • peak area zero for missing values (az)

Classification error
21
OutlinePhD Contributions
  • Preprocessing of Mass Spectra
  • Peak Picking
  • Peak Alignment
  • Reproducibility of MS technology
  • Preprocessing Evaluation
  • Learning (Feature Selection)
  • feature selection stability
  • logRatio kernel for SVM base FS

22
Learning - FS (1/10)Problem Presentation (1/1)
  • In preprocessing we do our best to reduce the
    information and to focus on relevant one ignoring
    groups of specimens (unsupervised)
  • For biomarker extraction task, the learning goal
    is to further reduce the information considering
    class labels of the specimens (supervised FS)

23
Learning - FS (2/10)Feature Selection Stability
(1/3)
  • In biomarker extraction from MS stability of
    selected features is also capital
  • it is preferable that a FS method select same
    biomarkers when little data perturbations are
    introduced
  • ICDM05, KIS07 Introduces a framework for
    estimation of FS stability
  • apply FS in different folds of a CV to guaranty
    22 of non overlapping instances
  • Estimate mean of attribute overlap in the 10
    selected feature sets

24
Learning - FS (3/10)Feature Selection Stability
(2/3)
  • With this framework stability of FS methods can
    be compared
  • Information Gain
  • ReliefF
  • linear-SVMRFE
  • linear-SVM (no RFE)
  • Random FS
  • Random FS can be computed theoretically
  • Results on MS
  • RF gt SVMONE gt SVMRFE gt IG gt Random

25
Learning - FS (4/10)Feature Selection Stability
(3/3)
  • Stability is a necessary condition but is not
    sufficient predictive power of selected feature
    has to be verify
  • Results show a general superiority of
    performances with SVM based feature sets
  • FS Stability Performance ? Concentrate on SVM
    based FS

26
Learning - FS (5/10)LogRatio Kernel (1/6)
?(X)
?(X)1,3
log(X)
1
3
n
2
  • Biological motivations
  • Log of Ratio often used to compare expressions
    (genomic)
  • Pair-wise combination focus learning on
    protein-protein interactions
  • Insensitive to instance normalization
  • Similar to polynomial kernel of degree 2, but
    instead of pair wise products of attributes,
    LogRatio feature space consider logarithm of
    their ratios
  • The n2 components of ?(x) can be visualized on a
    plot with n bars
  • LogRatio centring makes it also insensitive to
    attribute normalization

27
Learning - FS (6/10)LogRatio Kernel (2/6)
  • SVM decision function can be simplified like in
    linear SVM case
  • Vector W store the model like in linear SVM case
  • ?(W) is a normal vector to separating hyperplan
    in the feature space
  • ? attributes can be ranked according to ?(W)

28
Learning - FS (7/10) LogRatio Kernel (3/6)
  • Less significant attribute the one with median
    log(W) value because it minimizes absolute
    distance to all others
  • ? Rank attributes according to their distance to
    median
  • Perform RFE to improve ranking ? logRatio-SVMRFE

?(W)
log(W)
1
3
n
2
29
Learning - FS (8/10) LogRatio Kernel (4/6)
  • logRatio-SVMRFE appears good in removing
    redundant attributes because of the pair-wise
    attribute combination
  • Removing Redundant Attribute is not an easy
    problem
  • Attributes which express the same information in
    different units can be safely removed
  • In presence of noise, attributes which reflects n
    measures of the same information are
    complementary (by computing their means, we
    improve precision of the measure)
  • Redundancy is an important concept in expression
    analysis it is a sign of protein interaction
  • logRatio-SVMRFE can distinguish between the two
    type of redundancy

30
Learning - FS (9/10) LogRatio Kernel (5/6)
  • Data Iris dataset augmented by duplication of
    its attributes in order to express each value in
    cm mm
  • logRatio-SVMRFE rank first one of each
    information (petal/sepal length/width)
  • linear-SVMRFE rank 2 times the same information
    first once express in cm once in mm
  • Noise added only on cm attributes
  • logRatio-SVMRFE rank first all mm attributes
  • linear-SVMRFE rank first cm mm attributes
  • ? Perfect behaviour of logRatio-SVMRFE

31
Learning - FS (10/10) LogRatio Kernel (6/6)
  • Concerning classification performances,
    logRatio-SVM classifier performs similar to
    linear-SVM for MS dataset, and better on
    text-mining.
  • Essentially because of log transformation
  • But logRatio-SVMRFE seems less good than
    linear-SVMRFE in terms of stability selected
    attributes
  • inadequate model interpretation ?
  • more RFE step should be included ?

32
Conclusions
  • Facilitate and Reduce MS preprocessing effort
  • FS method focusing on attribute redundancy and
    protein-protein interactions
  • Relative data representation helps in reaching
    those goals
  • peak picking
  • logRatio kernel feature selection
  • The role of stability in data reduction and
    simplification
  • Preprocessing
  • Learning

33
Perspectives
  • Extensions of preprocessing to LCMS data
  • Harder Alignment Problems
  • More Complex Reproducibility Problems
  • Needs for Real Time Processing Learning
  • Further integration of preprocessing in learning
  • We investigated tuning preprocessing parameters
    considering classification performances Pro04
  • preprocessing should not be separated from
    learning investigate integration of unsupervised
    methods and supervised methods to automatically
    preprocess information considering the
    classification objective
  • Is it possible to extend logRatio kernel s.t. the
    feature space can be visualized in 2D (or nD)
    space instead of 1D ? This would bring enhanced
    information on attribute redundancy.

34
PhD Status
  • Problems
  • No article on logRatio kernel
  • Additional experiments
  • Compare preprocessing to other one
  • PhD Writting
  • Estimation 1/3 written (no corrections included)
  • Oral Exam
  • Subject Error Evaluation ? ICLP ?
  • Date Sep ?
  • PhD Defense
  • Expected Date End 2007

35
References
  • KIS07 A. Kalousis, J. Prados, and M. Hilario.
    Stability of feature selection algorithms.
    Knowledge and Information Systems,
    2007.JChromatB P. Zerefos, J. Prados, S.
    Garbis, A. Kalousis and A. Vlahou. Urinary
    protein profiling by MALDI-TOF-MS Tackling
    sample preparation and bioinformatics issues.
    Journal of Chromatography B, 2006.CBMS06 J.
    Prados, A. Kalousis, and M. Hilario. On
    Preprocessing of SELDI-MS Data and its
    Evaluation. CBMS, 2006. talkICDM05 A.
    Kalousis, J. Prados, and M. Hilario. Stability of
    feature selection algorithms. ICDM,
    2005.PKDD05 A. Kalousis, J. Prados, E.
    Rexhepaj, and M. Hilario. Feature extraction from
    mass spectra for classification. In 6th European
    Conference on Principles and Practice of
    Knowledge Discovery in Databases, 2005.6 M.
    Hilario, A. Kalousis, J. Prados, and P. A. Binz.
    Data mining for mass spectra-based cancer
    diagnosis and biomarker discovery. Drug Discovery
    Today Biosilico, 2(5)214-222, September
    2004.ICTAI04 A. Kalousis, J. Prados, J. C.
    Sanchez, L. Allard, and M. Hilario. Distilling
    classification models from cross-validation runs
    an application to mass spectrometry. ICTAI, 2004.
    talkPro04 J. Prados, A. Kalousis, L. Allard,
    O. Carrette, J. C. Sanchez, and M. Hilario.
    Mining mass-spectra for diagnosis and biomarker
    discovery of cerebral accidents. Proteomics,
    42320-2332, 2004.

36
PreprocessingPeak Picking Peak Area(1)
  • Definition The left valley, vl of a point p is
    the minimum intensity point located on the left
    of p s.t. their is no point with intensity
    greater than p between vl and p. (idem for right)
  • Peak points with high valley depth

37
PreprocessingPeak Picking Peak Area(2)
  • Peak Area
  • approximated by the area of the triangle obtained
    after fitting of piecewise linear models in two
    parts
  • Benefits
  • Parameter less
  • Feed back on confidence in a peak
  • No assumption on peak width
  • Doesnt require baseline correction
  • Doesnt require signal smoothing
  • Extension to 2D peak detection possible (LCMS)
  • Fast O(n)
  • Drawback
  • Cannot resolve overlapping peaks
Write a Comment
User Comments (0)
About PowerShow.com