Title: Abstract
1 Analysis of Temporal Patterns in Gene Expression
using Mutual Information Approach
Sergei Chumakov1,2, Yuan Sun1,4, Tong-Bin Li1,3,
Jorge E. Sánchez Rodríguez2, Arturo Chavez
Chavez2, B. Montgomery Pettitt1,4,5, and Yuriy
Fofanov1
1Department of Computer Science, University of
Houston, 2Physics Department, University of
Guadalajara, 3W. M. Keck Center for Computational
and Structural Biology, 4Institute for Molecular
Design, 5Department of Chemistry, University of
Houston,
Note, that there are two types of outliers here.
Some genes (1770, 183, 462, 1771, and 2305)
reveal a very high expression level (gt5), many
genes show the moderate expression (24) still
most of the genes are expressed at a lower level
(lt2). The exclusion of the five highly expressed
genes does not change qualitatively the figures.
However, the numerous outliers from the second
group destroy the revival of correlation between
the point t1 and the rest of the points at the
cycle period time. This does not affect MI,
mainly determined by the genes expressed at a
lower level.
Abstract An important problem in data analysis
is to find relations among the data by
determining groups of variables that depend on
each other. Dependences are often nonlinear and
may be covered by strong noise. Researchers need
to uncover, quantify and measure these relations.
Shannon mutual information (MI) provides one of
the possible ways to quantify such relations. In
contrast to the correlation coefficient, mutual
information is able to recognize nonlinear
dependencies and it is not sensitive to outliers.
However, ambiguity arises in applying MI to
continuous variables when we try to recover the
distribution of continuous variable values from a
finite number of experimental points. We propose
a method to find the optimal discrete
approximation for a continuous distribution. As
an example, we apply our approach to a publicly
available dataset, which contains a gene
expression time series for Saccharomyces
cerevisiae 1. The MI analysis allows us to
identify three different time patterns periodic
behavior related to the cell division cycle,
descending behavior due to the response to
initial synchronization (such as a factor arrest)
and stationary behavior due to genes not involved
directly in the cell division cycle or response
to initial synchronization. The difference
between MI and correlation analysis is discussed.
(a)
(c)
MI for Continuous Variables Consider two random
variables x,y with a joint distribution, P(x,y),
and with marginal distributions,
Joint entropy is
(d)
(b)
Marginal entropies are
and the same for Sy. Mutual information is
defined as follows,
-
- This is the information that one can deduce about
the value of x, measuring the value of y. - Properties
- 0 M min(Sx ,Sy).
- M0, if variables x,y are independent.
- If exists one to one map y f (x), then Sx
Sy Sxy M. - If y f (x), and there are values x1 ? x2
for which - f (x1) f (x2) y, then M Sy lt Sx .
- To express MI in terms of experimental data we
divide the range of values of x, y into boxes of
size ? ?x? y - j? x x lt (j1)? x k? y y lt (k1)?y
-
- Let Nkj be a number of experimental points in a
box (j,k). Then,
Fig.1 Time points t17,18 for the a-factor
arrest experiment from the Yeast dataset (a)
scatter plot of gene expression values (t17
against t18) (b) the entropy variation, (c)
the joint entropy and (c) -M as functions of the
box sizes.
Application to Yeast Data
Yeast time-series microarray experiments were
performed 1 using several methods of
synchronization a-factor arrest, elutriation,
cdc15 temperature sensitive mutant, and cdc28.
The synchronization experiments produced cell
cycle synchrony through one cell cycle in
elutriation, two cycles in a factor arrest, and
three cell cycles in cdc15. A common dataset
contains 2467 genes. Here we restrict ourselves
with the results for a-factor arrest only. In
this experiment a factor was added to
asynchronous yeast culture for 120 min. The a
factor was then removed, and the arrested cells
were placed in a fresh medium. Samples were taken
every 7 min for 119 min. We calculated MI for
all of the time points pairs (i.e., for all of
the pairs of different microarrays), see Fig. 2.
One would expect high dependence (correlation)
between one time point and its immediate
neighbor, when ?t 1. Dependence between a
time point and another point two time-intervals
down, ?t 2, would expectedly be lower.
Plotting MI versus ?t would produce a typical
relaxation curve. However, since the a-
experiments span approximately two cell cycles,
we see a second, but lower hump around ?t 9.
This time interval (97min 63min) indicates
the period of one cell cycle, which is
approximately half of the duration of the
experiment (119min). This reaffirms the authors
claim that a factor experiment spanned two cell
cycles. The correlation coefficient for the same
data is shown on the right-hand graph of Fig. 2.
Both, MI and CC graphs reveal the periodic
behavior due to the cell cycle. It is
interesting to note, that the correlation graphs
of time point t1 and time point t2 drop
dramatically versus ?t, and show virtually no
correlation after ?t 3, whereas the rest of the
time points show a periodic trend. On the other
hand, the MI graph for the point t1 with the
rest of the time points, t218, does reveal the
periodic behavior whereas the MI for the second
time point t2 with the rest of the time points,
t318, shows no revival at the cycle period
time. The analysis of scatter plots of one
time point against another (e.g., Fig.1a) shows
that nonlinear dependences cannot play a
significant role for this data. Thus, the
difference between the MI and CC graphs is caused
by outliers.
,
,
Fig.2. Mutual Information and Correlation
Coefficient as functions of time difference
between the points.
Introduction A common way to describe relations
among experimental data is the correlation
coefficient (CC). Shortcomings of this approach
are well known (1) it is designed to reveal
linear dependency and fails to identify nonlinear
ones (2) it is sensitive to outliers, i.e.
exceptional experimental points located apart of
the rest of the points. An alternative to CC is
the Mutual Information 2, defined in terms of
the entropies of the joint and marginal
distributions of experimental values. MI does not
make preference to linear dependences and it is
not sensitive to outliers. MI is widely used
in bioinformatics. However, it is primarily
applied to discrete variables (e.g.. statistics
of DNA sequences). Applications to continuous
variables are few 3 due to some ambiguity which
arises here. Indeed, the number of experimental
points, N, is always finite. One has to divide
the range of values of the continuous variable
into boxes and introduce the auxiliary discrete
variable Nk the number of points in a box.
Resulting distribution depends on parameters N,
?x and ?y (box sizes for a two-dimensional
distribution). This dependence affects the
entropy and MI. The problem is, how for a given
value of N to find the optimum box sizes,
?x(N), ?y(N). We propose a solution to this
problem and apply it to the analysis of gene
expression data, considering publicly available
RNA expression dataset from Stanford1. It
contains microarray data at 18 time points from
yeast Saccharomyces cerevisiae cultures
synchronized by three different methods a factor
arrest, elutriation, and arrest of a cdc15
temperature-sensitive mutant. This dataset has
been widely investigated and clusters of genes
with similar behavior were found using CC to
measure similarity in gene expression patterns.
Clustering using the MI similarity measure has
also been performed 3.
Conclusion We propose the method for calculation
of the mutual information for continuous
variables. This method is applied to gene
expression time series for Saccharomyces
cerevisiae. It allows us to identify genes
involved into cell division cycle and into
response to the initial synchronization. The
difference with the results of common correlation
analysis is discussed.
For finite N, ?x and ? y , one can find,
How to choose the best ? for a given N? We
assume that reducing twice the box sizes,
does not
significantly affect P(x,y) and Sxy. We choose
the box sizes that minimize the corresponding
variation of the entropy,
- Reference
- Spellman, P.T., G. Sherlock, M.Q. Zhang, V.R.
Iyer, K. Anders, M.B. Eisen, P.O. Brown, D.
Botstein and B. Futcher (1998). Comprehensive
Identification of Cell Cycle-regulated Genes of
the Yeast Saccharomyces cerevisiae by Microarray
Hybridization. Molecular Biology of the Cell, 9,
3273-3297. - Shannon, C.E. (1948). A mathematical theory of
communication. The Bell System Technical
Journal, 27, 379-423, ibid. 623-656 . - Butte, A.J. and I.S. Kohane (2000). Mutual
information relevance networks functional
genomics clustering using pairwise entropy
measurements. Proc. Pacific Symposium on
Biocomputing, 5, 415-426.
where Nkj experimental points in the box (kj) are
distributed among the four subboxes as
follows, Ni,kj Nkj /4 di(k,j) , i
1,2,3,4 In Fig.1 we show MI between the time
points t17 and t18 as functions of the box
sizes ?x, ?y for the a-factor arrest experiment
from the Yeast dataset. The upper left graph
shows the scatter plot of N2467 gene expression
values (t17 against t18). The lower left graph
shows the entropy variation as a function of the
box sizes. The upper and lower right graphs
present the joint entropy, Sxy , and -M as
functions of the box sizes. The minimum entropy
variation point is marked by an asterisk.
Acknowledgements The authors thank NIH, Baylor
College of Medicine SMART Program, TLC2, and the
Keck Center for Computational Biology for support.