Clustering of Gene Expression Time Series with Conditional Random Fields presentation

About This Presentation

Transcript and Presenter's Notes

Title: Clustering of Gene Expression Time Series with Conditional Random Fields

1
Clustering of Gene Expression Time Series with
Conditional Random Fields

Yinyin Yuan and Chang-Tsun Li
Computer Science Department

2
Microarray and Gene Expression

Microarray is a high throughput technique that
can assay gene expression levels of a large
number of genes in a tissue
Gene expression level is the relative amounts of
mRNA produced at specific time point and under
certain experiment conditions.
Thus microarray provides a mean to decipher the
logic of gene regulation, by monitoring the gene
expression of all genes in a tissue.

3
Gene Expression

Gene expression data are obtained from
microarrays and organized into gene expression
matrix for analysis in various methodologies for
medical and biological purposes.

4
Gene Series Time Series

A sequence of gene expression measured at
successive time points at either uniform or
uneven time intervals.
Reveal more information than static data as time
series data have strong correlations between
successive points.

Time Series Clustering

Assumption co-expression indicates
co-regulation, thus clustering identify genes
that share similar functions.

5
Probabilistic models

A key challenge of gene expression time series
research is the development of efficient and
reliable probabilistic models
Allow measurements of uncertainty
Give analytical measurement of the confidence of
the clustering result
Indicate the significance of a data point
Reflect temporal dependencies in the data points

6
Goal

Identify highly informative genes
Cluster genes in the dataset
GO (Gene Ontology) analysis of biological
function for each cluster.

7
HMMs and CRFs

HMMs
CRFs
HMMs are trained to maximize the joint
probability of a set of observed data and their
corresponding labels.
Independence assumptions are needed in order to
be computationally tractable.
Representing long-range dependencies between
genes and gene interactions are computationally
impossible.

8
Conditional Random Fields

CRFs are undirected graphical models that define
a probability distribution over the label
sequences, globally conditioned on a set of
observed features.

X x1, x2,, xn variable over the
observations
Y y1, y2,, yn variable over the
corresponding labels.
Observed data xj and class labels yj for all j
in a voting pool Ni for sample xi

9
CRFs Model

The CRFs model can be formulated as follows

The CRFs model can be expressed in a Gibbs form
in terms of cost functions

10
Cost function

The conditional random field model can also be
expressed in a Gibbs form in terms of cost
functions

Cost function

11
Potential function

Real-value potential functions are obtained and
used to form the cost function

D the estimated threshold dividing the set of
Euclidean distances into intra- and inter-class
distances

12
Finding the optimal labels

We adopt deterministic label selection, the
optimal label is determined by

13
Pre-processing

Linear Warping for data alignment
t -time point data transformed into t-1feature
space
Differences between consecutive time points
inversely proportional to time intervals are used
as features as they can reflect the temporal
structures in the time series.
Voting pool keeps one most similar sample, one
most-different sample and k-2 randomly selected
samples.

14
Process

Initialization
Each sample is assigned a random label
Voting pools are formed randomly
Samples interact with each other via its voting
pool progressively
Update labels
Updata voting pool
Until steady

15
Experimental Validation

Both biological dataset and simulated dataset
Adjusted Rand index Similarity measure of two
partitions
Yeast galactose dataset
Gene expression measurements in galactose
utilization in Saccharomyces cerevisiae
Subset of meansurements of 205 genes whose
expression patterns reflect four functional
categories in the Gene Ontology (GO) listings
4 repeated measurements across 20 time points

16
Results for Yeast galactose dataset

The four functional categories of
Yeast galactose dataset

Experimental results on Yeast galactose dataset
We obtained an average Rand index value of 0.943
in 10 experiments, greater than the result 0.7 in
Tjaden et al. 2006.
17
Simulated Dataset

Data are generated for 400 genes across 20 time
points from six artificial patterns to model
periodic, up-regulated and down regulated gene
expression profiles.
High Gaussian noise is added.
Perfect partitions are obtained with 10 iterations

18
Conclusions

A novel unsupervised Conditional Random Fields
model for efficient and accurate gene expression
time series clustering
All data points are randomly initialized
The randomness of the voting pool facilitates
global interactions

Clustering of Gene Expression Time Series with Conditional Random Fields PowerPoint PPT Presentation