Clustering of Gene Expression Time Series with Conditional Random Fields PowerPoint PPT Presentation

presentation player overlay
1 / 19
About This Presentation
Transcript and Presenter's Notes

Title: Clustering of Gene Expression Time Series with Conditional Random Fields


1
Clustering of Gene Expression Time Series with
Conditional Random Fields
  • Yinyin Yuan and Chang-Tsun Li
  • Computer Science Department

2
Microarray and Gene Expression
  • Microarray is a high throughput technique that
    can assay gene expression levels of a large
    number of genes in a tissue
  • Gene expression level is the relative amounts of
    mRNA produced at specific time point and under
    certain experiment conditions.
  • Thus microarray provides a mean to decipher the
    logic of gene regulation, by monitoring the gene
    expression of all genes in a tissue.

3
Gene Expression
  • Gene expression data are obtained from
    microarrays and organized into gene expression
    matrix for analysis in various methodologies for
    medical and biological purposes.

4
Gene Series Time Series
  • A sequence of gene expression measured at
    successive time points at either uniform or
    uneven time intervals.
  • Reveal more information than static data as time
    series data have strong correlations between
    successive points.

Time Series Clustering
  • Assumption co-expression indicates
    co-regulation, thus clustering identify genes
    that share similar functions.

5
Probabilistic models
  • A key challenge of gene expression time series
    research is the development of efficient and
    reliable probabilistic models
  • Allow measurements of uncertainty
  • Give analytical measurement of the confidence of
    the clustering result
  • Indicate the significance of a data point
  • Reflect temporal dependencies in the data points

6
Goal
  • Identify highly informative genes
  • Cluster genes in the dataset
  • GO (Gene Ontology) analysis of biological
    function for each cluster.

7
HMMs and CRFs
  • HMMs
    CRFs
  • HMMs are trained to maximize the joint
    probability of a set of observed data and their
    corresponding labels.
  • Independence assumptions are needed in order to
    be computationally tractable.
  • Representing long-range dependencies between
    genes and gene interactions are computationally
    impossible.

8
Conditional Random Fields
  • CRFs are undirected graphical models that define
    a probability distribution over the label
    sequences, globally conditioned on a set of
    observed features.
  • X x1, x2,, xn variable over the
    observations
  • Y y1, y2,, yn variable over the
    corresponding labels.
  • Observed data xj and class labels yj for all j
    in a voting pool Ni for sample xi

9
CRFs Model
  • The CRFs model can be formulated as follows
  • The CRFs model can be expressed in a Gibbs form
    in terms of cost functions

10
Cost function
  • The conditional random field model can also be
    expressed in a Gibbs form in terms of cost
    functions
  • Cost function

11
Potential function
  • Real-value potential functions are obtained and
    used to form the cost function
  • D the estimated threshold dividing the set of
    Euclidean distances into intra- and inter-class
    distances

12
Finding the optimal labels
  • We adopt deterministic label selection, the
    optimal label is determined by

13
Pre-processing
  • Linear Warping for data alignment
  • t -time point data transformed into t-1feature
    space
  • Differences between consecutive time points
    inversely proportional to time intervals are used
    as features as they can reflect the temporal
    structures in the time series.
  • Voting pool keeps one most similar sample, one
    most-different sample and k-2 randomly selected
    samples.

14
Process
  • Initialization
  • Each sample is assigned a random label
  • Voting pools are formed randomly
  • Samples interact with each other via its voting
    pool progressively
  • Update labels
  • Updata voting pool
  • Until steady

15
Experimental Validation
  • Both biological dataset and simulated dataset
  • Adjusted Rand index Similarity measure of two
    partitions
  • Yeast galactose dataset
  • Gene expression measurements in galactose
    utilization in Saccharomyces cerevisiae
  • Subset of meansurements of 205 genes whose
    expression patterns reflect four functional
    categories in the Gene Ontology (GO) listings
  • 4 repeated measurements across 20 time points

16
Results for Yeast galactose dataset
  • The four functional categories of
  • Yeast galactose dataset

Experimental results on Yeast galactose dataset
We obtained an average Rand index value of 0.943
in 10 experiments, greater than the result 0.7 in
Tjaden et al. 2006.
17
Simulated Dataset
  • Data are generated for 400 genes across 20 time
    points from six artificial patterns to model
    periodic, up-regulated and down regulated gene
    expression profiles.
  • High Gaussian noise is added.
  • Perfect partitions are obtained with 10 iterations

18
Conclusions
  • A novel unsupervised Conditional Random Fields
    model for efficient and accurate gene expression
    time series clustering
  • All data points are randomly initialized
  • The randomness of the voting pool facilitates
    global interactions

19
Future work
  • Various similarity measurement
  • Advantage of information from repeated
    measurements
  • Training and testing procedures
Write a Comment
User Comments (0)
About PowerShow.com