Gaussian Mixture Density Estimation applied to Microarray Data - PowerPoint PPT Presentation

1 / 1

About This Presentation

Title:

Gaussian Mixture Density Estimation applied to Microarray Data

Description:

Making sense of microarray data distributions. Bioinformatics, 18, 576-84. ... We first analyzed data derived from a human X chromosome-specific cDNA array ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 2

Provided by: Wince

Category:

more less

Transcript and Presenter's Notes

Title: Gaussian Mixture Density Estimation applied to Microarray Data

1
Gaussian Mixture Density Estimation applied to
Microarray Data C. Steinhoff, T. Müller, U.A.
Nuber and M. Vingron Max Planck Institute for
Molecular Genetics Dept. Computational Molecular
Biology Ihnestr. 73, D-14195 Berlin, Germany
ABSTRACT Microarray experiments can be used to
determine gene expression profiles of specific
cell types. In this way, genes of a given cell
type might be categorized into active or present.
Typically, one would like to infer a probability
for each gene whether it is expressed in a given
tissue or experiment or not. This has mainly been
addressed by selecting an arbitrary threshold and
defining spot intensities above this threshold as
high signal'' and below as low signal''.
However, this approach does not yield a
probabilistic measure. As a probabilistic model
we propose to fit a mixture of normal
distributions to the data and thus to infer not
only an overall description of the data but also
a probabilistic framework. In the literature
many methods are described in which various kinds
of distributions are fitted to ratios of
microarray data (Ghosh et al., Li et al.).
Fitting distributions to the entire dataset of a
single sample rather than fitting the ratios of
two experiment profiles has not been studied very
well. Hoyle at al. propose to fit a log-normal
distribution with a tail that is close to a power
law. This is - to our knowledge - the only
publication in which single sample-intensities
are being approximated. The use of one specific
parameterized distribution however poses problems
when overall intensities occur in various shapes,
esp. show non-uniformly shaped tails. These
effects can occur for different reasons and
cannot be captured by normalization in all cases.
Hoyle at al. used only the highest genes for the
fitting procedure. This simultaneously reduces
the probability of observing a mixture of several
densities but it focuses only on the highest
expressed genes and provides no overall
model. Results Here, we present examples of
microarray data which are unlikely to be properly
fitted by a single log-normal distribution as
described by Hoyle et al.. The data rather
appears to be a mixture of different
distributions and can lead to multi-modal shapes
of various kinds. Different reasons may account
for this (1) When using relatively small
microarrays there might be genes being
over-represented at a specific intensity-range or
kinds of truncated data occur. (2) Saturation
effects could be another reason. (3) Also,
specific effects which can not be localized and
captured by normalization might lead to varying
shapes. (4) If one microarray-design is based on
different oligo selection procedures the
resulting intensity distribution could show more
than one mode. Thus we do not assume, that there
is an unimodal single overall distribution which
can explain microarray experiments in general.
One biological motivation for our approach is
the selection of a set of genes which is highly
expressed in a specific tissue. Furthermore it
might be desirable to give a probability for a
gene to be highly expressed. The main intention
of this work is to infer a robust probabilistic
framework that captures inhomogeneous datasets.
The goal of this study is to model the entity
of data points which do not follow a specific
distribution but rather show shapes which might
represent several distributions. In fact what we
observe by looking at the whole dataset is the
mixture of a number of distributions. In our
case we assume a mixture of normal distributions.
The maximum-likelihood parameters of this
probabilistic model are estimated by an
application of the EM algorithm. The optimal
number of densities is being estimated using the
Bayesian information criterion (BIC). Software
and Data We used MatLab (version 6.1.0.450
(R12.1)) to implement the EM algorithm including
BIC calculation. We consider three different
types of datasets in order to examine the
performance of our fitting procedure. These
comprise one dye-swap experiment, a series of
latin square experiments using quantifiable
spike-in samples and simulated datasets.
Mixture Model
Dataset 1 Dye-swap Experiment We first analyzed
data derived from a human X chromosome-specific
cDNA array with 4799 spotted cDNA sequences.
Labeled cDNAs were co-hybridized on the array and
a repeat experiment was performed with
fluorescent dyes swapped. Fluorescence
intensities of Cy3 and Cy5 were measured
separately at 532 and 635 nm with a laser scanner
(428 Array Scanner, Affymetrix). Image processing
was carried out using Microarray Suite 2.0 (IPLab
Spectrum Software, Scanalytics, Fairfax, VA,
USA). Raw spot intensities were locally
background subtracted. The intensity values of
each co-hybridization experiment were then
subjected to variance stabilizing normalization
as described in Huber et al.. Dataset 2
Spike-in Dataseries We used the Affymetrix Latin
Square dataset which consists of a series of
microarray experiments with genes spiked-in at
known concentrations and arrayed in a Latin
square format (http//www.affymetrix.com/analysis/
download\_center2.affx). A total of 59
experiments describe three cycles of dilutions
ranging from 0 pmol up to 1024 pmol for each of
the 14 spiked-in genes. Affymetrix reported
problems with two out of the 14 genes, thus we
only used the remaining 12 for our analysis. All
59 experiments were simultaneously normalized by
using variance stabilization as described in
Huber et al.. Dataset 3 Simulated Data We
drawed 104 samples from a mixture of normal
densities. A necessary condition for the proposed
estimation procedure is that the underlying model
parameters can at least be reestimated. We show
that our implementation of the EM algorithm in
fact can properly infer the underlying model. As
an example we show the performance sampling for
the following mixture model 0.6N(0.5,?0.6)0.3N(1
.5,?0.4)0.1N(3.2,?0.5). Since there have been
discussions that log-microarray data could follow
a normal distribution in low intensities and can
show a gamma distribution like tail in high
intensities, we also showed, that a mixture of a
normal and a gamma distribution can be
approximated quite well by a mixture of normal
distributions. Furthermore the normal
distribution of the underlying model can be
properly reestimated. As an example we sampled
once more 104 samples from the following
model ? N(µ,?) ? F(r,a), where ?0.6, ? 0.4,
µ 0.5, ? 1.2, r2, a3 and F is a gamma
distribution with parameters r and
a. Quantil-plotsOverview of fitting results In
column A the logarithm of normalized intensities
(y-axis) of each dataset is plotted against
log-rank of top 500 spots (x-axis). The solid
line marks the fit to Zipfs law. In cloumn B the
logarithm of normalized intensities (y-axis) of
each dataset is plotted against log-rank of all
spots (x-axis). In column C estimated quantiles
resulting from the optimal fit to a mixture of
normal distributions (x-axis) is plotted against
the empirical quantiles of each dataset.
Dataset2
Fitting a mixture of normal distributions to
dataset 2
...
Dataset3b
Dataset3a
empirical relative frequencies
empirical relative frequencies
normalised signal intensities
normalised signal intensities
- 2 BIC /
µ
?
?

Literature Ghosh, D. and Chinnaiyan, A.M. (2002)
Mixture modelling of gene expression data from
microarray experiments. Bioinformatics, 18,
275-86. Hoyle, D.C., Rattray, M., Jupp, R. and
Brass, A. (2002) Making sense of microarray
data distributions. Bioinformatics, 18,
576-84. Li, W. and Yang, Y. (2002) Zipf's law in
importance of genes for cancer classification
using microarray data. Theor Biol, 219,
539--551. Huber, W.,Von Heydebreck, A., Sültmann,
H., Poustka A., Vingron, M. (2002) Variance
stabilization applied to microarray data
calibration and to the quantification of
differential expression. Bioinformatics, 18 Suppl
1, S96-S104
Fitting a mixture of normal distributions to
dataset 3b

Write a Comment

User Comments (0)