Title: Linear Modeling of Genetic Networks from Experimental Data
1Linear Modeling of Genetic Networksfrom
Experimental Data
- E.P. van Someren, L.F.A. Wessels and M.J.T.
Reinders - ISMB 00.
- Talk by Kyu-Baek Hwang
2Abstract
- Topic
- Modeling regulatory interactions between genes
- Linear genetic networks
- Gene expression data
- The dimensionality problem (contribution of this
paper) - The number of genes gtgt the number of measured
time points ? many solutions that fit the
training data - Prototypical genes (by clustering) ? biological
genetic networks are sparse and redundant. - Experiments
- An artificial dataset
- S. cerevisiae yeast cell-cycle dataset
3Exploitation ofDNA Microarray Datasets
- DNA microarray ? simultaneous measurements on the
expression levels of thousands of genes - Infer functionality of genes based on this new
massive datasets. - Clustering and pattern recognition techniques
(NNs and SVMs) - The regulatory interactions between genes
- Boolean networks, Bayesian networks, linear
networks, neural networks, and differential
equations - Data sparseness problem inherent in the analysis
of microarray data ? as few parameters as possible
4Linear Networks
- The basic linear model
- ?? 1
- where xj(t) represents the activity level of gene
j at time point t, ri,j represents how strongly
gene i controls gene j and N is the total number
of genes under consideration. - Prototypical genes ? hierarchical clustering
- Tackling the dimensionality problem
- Input and output sharing among genes involved
within a gene family or pathway - Genes are estimated to interact with four to
eight other genes.
5The Modeling Approach
6Preprocessing Step I Thresholding
- Eliminate insignificant signals (genes).
- Due to experimental noises
- Gene expression levels in different cultures
under similar conditions - Vary up to ratio of two
- Gene expression levels in different cultures
under different conditions - Vary up to ratio of two to five
- Genes with profiles that remain below an absolute
value of two ? do not participate in regulation - Reduce the dimensionality problem.
- Avoid learning erroneous relationships.
7Preprocessing Step II Normalization
- If two signals share the actually (?) same
characteristics, these two signals should be very
similar after normalization. (Euclidean distance
vs. Pearson correlation)
Used in the experiments
8The Linear Model Calls Clustering
- A set of measurements of gene expression levels
at consecutive time points. - The linear model is learned by Gaussian
elimination. - P a particular solution
- H a basis of homogeneous solutions
- F a set of free variables
More details on the whiteboard.
9Clustering ? Prototypes
- Find groups (clusters) of signals based on the
similarity. - Conceptualize the data by representing each
cluster with a proper prototype. - Selection of distance measuring metric is very
important. - Clustering method complete linkage hierarchical
clustering based on the Euclidean distance
measure. - Prototype
- Reduction of noises in the gene expression levels
- The mean value of all the signals in one cluster
(RMS)
10Prototypes
- Transforming signals to prototypes
- The inverse
11Experiments
12An Artificial Linear System
- An artificial linear system with five genes.
- R5 matrix in graphical representation.
13Expansion of the System
- Replication of R5 to R25 matrix.
- The (i, j)-th 5 ? 5 sub-matrix in R25 is
constructed by placing r5i, j on the diagonal
with all other positions in the sub-matrix
occupied by zeros.
14Time Response for the 25 ? 25 System
- Initial values of genes in the same cluster were
set to the more similar values than the values of
genes in the other clusters. (20 time points)
15Estimation of the Model (1/2)
- Experimental steps (k 1 25)
- 1. The set of prototypes, Yk associated with the
clusters in Ck was determined. - 2. The weight matrix, , corresponding to
each clustering was determined from Yk. - 3. Given the complete model and the initial
state, approximations to the original signals can
be computed as follows - One-step approximation
- Free-run approximation
16Estimation of the Model (2/2)
- Experimental steps (continued)
- 4. The mean squared errors (MSE) were computed.
(Eos,k, Efr,k) - 5. The weighted prototype MSE, Ewp,k is computed.
17Error Curves
- Error curves as a function of the number of
clusters
18The Resulting Model
- The resulting model (analysis of multiple causes
is not easy.)
19Real Experiments Yeast Data Set
- Gene expression profiles extracted from the 2467
genes in the budding yeast S. cerevisiae by
Eisen. - Considered conditions
- Mitotic cell division cycle, sporulation and
temperature and reducing shocks - Thresholding
- For the ALPHA subset 18 time points with 45
genes. - For the CDC15 subset 15 time points with 113
genes.
20The Effect of Normalization
- Normalization
- If the cluster size is equal to or greater than
one less than the time step size, the prototype
free run error is zero. (the limit of
over-constrained condition)
The one-step MSE of all datasets and the four
kinds of normalization
21Error Curves on the CDC15 Dataset
- Fitting the linear model on the CDC15 subset.
- The error curve
22The Resulting Model on CDC15
23Error Curves on the ALPHA Dataset
- Fitting the linear model on the ALPHA subset.
- The error curve
24The Resulting Model on ALPHA
25Resulting Model without Normalization
- The resulting model without normalization (more
reasonable ?)
Mating
26Summary
- The dimensionality problem was tackled by
clustering. - Biologically sound.
- Linear models were used to represent the
relationships between the resulting
gene-prototypes. - Good balance between the model complexity and the
accuracy. - No intrinsic semantics was found.