Retention Time based Peak Clustering in Comprehensive Two-Dimensional Gas Chromatography - PowerPoint PPT Presentation

About This Presentation

Retention Time based Peak Clustering in Comprehensive Two-Dimensional Gas Chromatography


Clustering, the process of grouping similar items, can expedite the process of ... Clustering is the process of grouping data items into classes or clusters so ... – PowerPoint PPT presentation

Number of Views:326
Avg rating:3.0/5.0
Slides: 23
Provided by: Shi83
Learn more at:


Transcript and Presenter's Notes

Title: Retention Time based Peak Clustering in Comprehensive Two-Dimensional Gas Chromatography

Retention Time based Peak Clustering in
Comprehensive Two-Dimensional Gas Chromatography
A presentation for Multi-Dimensional
ChromatographySystems, Informatics, and
  • By
  • Shilpa Deshpande
  • University of Nebraska-Lincoln

Outline of the Presentation
  • Brief Introduction to GCxGC
  • Motivation for Clustering in GCxGC
  • Clustering Overview
  • Implementation and Results
  • Application
  • Challenges
  • Conclusion

Introduction to GCxGC
  • Comprehensive two-dimensional gas chromatography
    (GCxGC) is a new technology for chemical
    separation. The typical system components are
    shown in Figure 1.
  • Retention time The retention time of a solute
    is the elapsed time between the time of injection
    of a solute and the time of elution of the peak
    maximum of that solute 1.
  • GCxGC images, have two axes for the retention
    times (time required to pass through each of the
    two column), as shown in Figure 2.

Figure 1 System components for GCxGC with
thermal modulation
Figure 2 GCxGC image
  • The relationship between chemical structure and
    position on the retention time plane produces a
    basis for logical chemical interpretation of
    chromatograms and is an important benefit of
    GCxGC 2.
  • Clustering, the process of grouping similar
    items, can expedite the process of chemical
    quantification, which is of interest to chemists.

Clustering in GC Image
  • In GC Image, templates are patterns of peaks and
    graphic objects observed image(s) used to
    recognize similar patterns of peaks in subsequent
  • Template matching is the process of assigning the
    known information about peaks in an image to
    similar peaks in other subsequent image.
  • A peak template is a set of peaks, whose metadata
    has been identified. A target peak set is a set
    of peaks whose metadata is to be determined.
  • Retention time based peak clustering can be used
    to automate the process of construction of peak

Clustering Overview
  • Clustering is the process of grouping data items
    into classes or clusters so that data items
    within a cluster are similar to each other, but
    dissimilar to data items in other clusters 4.
  • Clustering is an unsupervised learning in pattern
    recognition. The objective of clustering is to
    reveal" the structure of datapoints into
    sensible clusters (groups) which allow to
    discover similarities and differences among
    datapoints and to derive useful conclusions about
    them 9.

Clustering Tasks
  • Feature Selection Primary and secondary column
    retention time.
  • Proximity Measure Definition Euclidean distance.
  • Clustering Criterion Selection Minimize or
    maximize proximity measure.
  • Clustering Algorithm Definition Reveal the
  • of data.
  • Result Validation Cophenetic correlation
  • Result Interpretation Mass spec analysis, CLIC.

Clustering Overview Clustering algorithms
  • Partitional K-means, K-median, K-mode
  • Each partition have at least one datapoint. Every
    datapoint belongs to one and only one partition.
  • K-seeds The number of clusters needs to be
    specified, as shown in Figure 3.
  • Seed Seed is the representative of datapoints
    in the cluster.
  • Iterative Relocation Scheme Iteratively reassign
    the datapoints to the seeds to minimize the
    proximity measure between seed and datapoint.
  • K-means is scalable, easy to implement and
    performs well when clusters are compact clouds.

Figure 3 k-means demo 5
Clustering Overview Clustering algorithms
  • Hierarchical Hierarchical algorithms produce a
    hierarchy of nested clustering 5, also known as
    a dendrogram.
  • Approaches
  • Agglomerative In these algorithms, the initial
    number of clusters is equal to number of
    datapoints. The clustering produced at each step
    results from previous one by combining two
    clusters into one.
  • Divisive In divisive algorithms, the initial
    number of clusters is one. The clustering
    produced at each step results from previous one
    by splitting one cluster into two.
  • Advantages of hierarchical clustering are.
  • Hierarchical clustering can handle any forms of
    similarity or distance and can be applied to any
    feature type.
  • Disadvantages of hierarchical clustering are.
  • Stopping criterion not defined.
  • Most of the hierarchical algorithms do not
    revisit the constructed clusters, so any
    improvement in clustering is not possible 6.

Clustering Overview Clustering algorithms -
  • Single linkage In single linkage algorithm,
    nearest clusters, that is, datapoint from each
    cluster which are nearest from each other, are at
    minimum distance, as shown in figure 4.
  • Complete linkage In complete linkage algorithm,
    farthest clusters, that is, datapoint from each
    cluster which are farthest from each other, are
    at minimum distance.
  • Average linkage In average linkage algorithm,
    average clusters, that is, datapoint from each
    cluster which are at average distance from each
    other, are at minimum distance.

Figure 4 single linkage cluster distance
Clustering Overview Clustering algorithms
  • Complete Linkage with PCA
  • Provides a new proximity measure area.
  • Figure 5 , shows the two-dimensional data in
    retention time space.
  • Figure 6 , shows the data in the PCA space, the
    enclosing box shows the area.
  • In principal component analysis, the product of
    standard deviation along major and minor axis is
    proportional to the area generated by the data.

Figure 5 Original Data
- Figures are borrowed from 7
Figure 6 Data in PCA space
Implementation and Results
  • GC Image This project is implemented in GC
  • To perform clustering on selected peaks, in blob
    mode, select Cluster blobs from Edit.
  • Cluster Dialog
  • Clustering algorithm
  • Scaling
  • Number of clusters
  • Include Clusters
  • Add to Template
  • Convex Hull Convex Hull Gift Wrap is implemented
    to create graphic for each cluster.

Figure 7 Cluster Dialog
  • Figure 8, shows supelco standard image which has
    27 peaks. These peaks are clustered with k-means
    (shown in Figure 9) and complete linkage with PCA
    (shown in Figure 10).

Figure 8 Peaks selected
Figure 10 complete linkage with PCA clustering
Figure 9 k-means clustering results
Application Automated Template Construction
  • Select the interesting blobs in the image.
  • Input the desired number of clusters in the data
    for k-means. For other clustering algorithms,
    user can experiment with different number of
    clusters and then set the optimal number as
    desired number of clusters.
  • Clustering algorithm computes the clusters. Draw
    the convex hulls of clusters as graphics.
  • Quantification can be done on a cluster to
    cluster basis.
  • Marker peaks are selected from each cluster which
    are used to create peaks in the template.

Application Automated Template Construction
Figure 12 Blob Set Table
Figure 11 Peaks Selected for template
Figure 13 Template construction
  • Number of clusters This is one of the classic
    problems clustering community faces.
  • To reduce the dilemma, this project has preview
    option. In this project implementation, for
    hierarchical algorithms, user can select various
    number of clusters with slider and see the
    result, before finally applying desired number of

Figure 14 Initial Clusters Dialog
Figure 15 4 clusters
  • Subjectivity In clustering algorithms, all the
    features are treated the same.
  • For example, in this, in k-means clustering, to
    calculate the squared Euclidean distance both the
    features (primary and secondary column retention
    time) are used.
  • This equation does not have any weighing factor.
    In GCxGC domain, primary and secondary column
    retention time might have different significance.
  • To counter this problem, in this project,
    weighing factor is applied, as shown in Figure
    16, the primary column scaling is 2.0.

Figure 16 First Column Scaling 2.0
  • Choice of clustering algorithm Numerous
    clustering algorithms are available in the field,
    categorized as partitional, hierarchical,
    density-based, grid-based.
  • Different types of clustering algorithms can
    extract different clusters from the data.
  • The user has to understand the different outputs
    these clustering algorithms produce before making
    any selection of clustering algorithm.
  • In this project, different clustering algorithms
    are implemented as shown in Figure 17. Also user
    can undo the clustering and again perform
    clustering on the same set of data using
    different clustering algorithms.

Figure 17 Different Clustering algorithms
  • Retention time based peak clustering in
    comprehensive two dimensional gas chromatography
    is a new stream of research. Clustering in one
    and two dimensional gas chromatography has
    usually been done in mass spec domain 8.
  • Clustering algorithms in one or two dimensional
    gas chromatography, usually use principal
    component analysis to reduce the size of data and
    use different clustering algorithms to cluster in
    reduced domain space.
  • This project used a algorithm to perform
    clustering using principal component analysis to
    get natural clusters.
  • This project proposes to use these clusters to
    automatically construct templates.
  • The choice of clustering algorithm and user
    parameters play an important role in extracting
    clusters which can be validated by experts or
    some other means such as library search of mass
    spec image.

  • 1 Retention-time, \http//www.chromatography-onl retention/time.html."
  • 2 R. B. Gaines, G. S. Frysinger, M. S.
    Hendrick-Smith, and J. D. Stuart, Oil spill
    source identification by comprehensive
    two-dimensional gas chromatography," Environ.
    Sci. Technol., vol. 33, pp. 2106-2112, 1999
  • 3 GC Image, http//
  • 4 J. Han and M. Kamber, Data Mining Concepts
    and Techniques. San Francisco Morgan Kauffann
    Publisher, 2001
  • 5 A. W. Moore , K-means and Hierarchical
    Clustering Slides, http//
    orials, Carnegie Mellon University
  • 6 P. Berkhin, Survey of Clustering Data Mining
    Techniques. Accrue Software, Inc
  • 7 L. I. Smith, A tutorial on Principal
    Components Analysis, 2002
  • 8 K. Pierce, J. Hopea, K. Johnson, B. Wright,
    and R. Synovec, Classification of gasoline data
    obtained by gas chromatography using a piecewise
    alignment algorithm combined with feature
    selection and principal component analysis,"
    Journal of Chromatography, vol. 1096, pp.
    101-110, 2005.
  • 9 S. Theodoridis and K. Koutroumbus, Pattern
    Recognition. Academic Press, 1998.

Extra Slides - Results
Figure 18 Single linkage clustering
Figure 19 Complete linkage with PCA clustering
Euclidean Distance
Write a Comment
User Comments (0)