Gene Expression Analysis and Modeling - PowerPoint PPT Presentation

About This Presentation
Title:

Gene Expression Analysis and Modeling

Description:

Gene Expression Analysis and Modeling. Guillaume Bourque. Centre de Recherches Math matiques ... A gene regulatory network is represented by directed acyclic graph: ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 62
Provided by: Guillaum85
Category:

less

Transcript and Presenter's Notes

Title: Gene Expression Analysis and Modeling


1
Gene Expression Analysis and Modeling
  • Guillaume Bourque
  • Centre de Recherches Mathématiques
  • Université de Montréal
  • August 2003

2
DNA Microarrays
  • Experiment design
  • Noise reduction
  • Normalization
  • Data analysis

3
Outline
  • Microarray data analysis techniques
  • Clustering hierarchical and k-means
  • SVD and PCA
  • SVM Support Vector Machines
  • Gene network modeling
  • Boolean networks
  • Bayesian models
  • Differential equations

4
Outline
  • Microarray data analysis techniques
  • Clustering hierarchical and k-means
  • SVD and PCA
  • SVM Support Vector Machines
  • Gene network modeling
  • Boolean networks
  • Bayesian models
  • Differential equations

5
Gene Expression Data
6
Gene Expression Matrix
Given an experiment with m genes and n assays we
produce a matrix X where xij expression
level of the ith gene in the jth assay.
7
Goals of Clustering
  • Clustering genes
  • Classify genes by their transcriptional response
    and get an idea of how groups of genes are
    regulated.
  • Potentially infer gene functions of unknown
    genes.
  • Clustering assays
  • Classify diseased versus normal samples by their
    expression profile.
  • Track the expression levels at different stages
    in the cell.
  • Study the impact of external stimuli.

8
Clustering Genes
X
m genes
n assays
9
Clustering Steps
  • Choose a similarity metric to compare the
    transcriptional response or the expression
    profiles
  • Pearson Correlation
  • Spearman Correlation
  • Euclidean Distance
  • Choose a clustering algorithm
  • Hierarchical
  • K-means

10
Similarity Metric
  • Choice of the best metric depends on the
    normalization procedure.
  • Must be cautious of potential pitfalls.
  • CorrelationsCorrelation coefficients are values
    from 1 to 1, with 1 indicating a similar
    behavior, 1 indicating an opposite behavior and
    0 indicating no direct relation.
  • Euclidean distance

11
Hierarchical Clustering
g1 g2 g3 g4 g5
g1 0.23 0.00 0.95 -0.63
g2 0.91 0.56 0.56
g3 0.32 0.77
g4 -0.36
g5
g1 g2 g3 g4 g5
g1 0.23 0.00 0.95 -0.63
g2 0.91 0.56 0.56
g3 0.32 0.77
g4 -0.36
g5
  • Find largest value is similarity matrix.
  • Join clusters together.
  • Recompute matrix and iterate.

12
Hierarchical Clustering
g1 , g4 g2 g3 g5
g1 , g4 0.37 0.16 -0.52
g2 0.91 0.56
g3 0.77
g5
g1 , g4 g2 g3 g5
g1 , g4 0.37 0.16 -0.52
g2 0.91 0.56
g3 0.77
g5
  • Find largest value is similarity matrix.
  • Join clusters together.
  • Recompute matrix and iterate.

13
Hierarchical Clustering
g1 , g4 g2 , g3 g5
g1 , g4 0.27 -0.52
g2 , g3 0.68
g5
g1 , g4 g2 , g3 g5
g1 , g4 0.27 -0.52
g2 , g3 0.68
g5
  • Find largest value is similarity matrix.
  • Join clusters together.
  • Recompute similarity matrix and iterate.

14
Cluster Joining
  • One of the issue with hierarchical clustering is
    how to recompute the similarity matrix after
    joining clusters. Here are 3 common solutions
    that define different types of hierarchical
    clustering
  • Single-link minimum distance between any member
    of one cluster to any member of the other
    cluster.
  • Complete-link maximum distance between any
    member of one cluster to any member of the other
    cluster.
  • Average-link average distance between any
    member of one cluster to any member of the other
    cluster.

15
Interpreting the Results
16
Clustering Example
Eisen et al. (1998), PNAS, 95(25) 14863-14868
17
K-means Clustering
  • Expression profiles are displayed in n
    dimensional space.

k 3
  • First cluster center is picked at random between
    all data points.
  • Other cluster centers are picked as far as
    possible from previous clusters centers.

18
K-means Clustering
  • Associate each data point to the closest cluster
    center.

k 3
  • Recompute cluster centers based on new clusters.

19
K-means Clustering
  • Associate each data point to the closest cluster
    center.

k 3
  • Recompute cluster centers based on new clusters.
  • Iterate until the clusters remain unchanged.

20
K-means Clustering
  • Associate each data point to the closest cluster
    center.

k 3
  • Recompute cluster centers based on new clusters.
  • Iterate until the clusters remain unchanged.

21
K-means Clustering
  • Associate each data point to the closest cluster
    center.

k 3
  • Recompute cluster centers based on new clusters.
  • Iterate until the clusters remain unchanged.

22
Singular Value Decomposition
Xm x n Um x n Sn x n V T n x n (n ? m)

gene expression matrix
23
Singular Value Matrix
Sn x n

Singular values are organized from largest to
smallest s1 ? s2 ? ? sk ? ? sn .
24
Why SVD?
  • SVD extracts from the gene expression matrix
  • n eigenassays
  • m eigengenes
  • n singular values
  • We can represent the transcriptional response of
    each gene as a linear combination of the
    eigengenes.
  • We can represent the expression profile of each
    assay as a linear combination of the eigenassays.
  • Allows for dimensionality reduction and for the
    identification of important components.

25
SVD and PCA
  • There is a direct correspondence between SVD and
    PCA (Principal Component Analysis) when
    calculated on covariance matrices.
  • If we normalize X so that its columns have a 0
    mean. We get that the eigengenes are the
    principal components of the transcriptional
    responses.
  • If we normalize X so that its rows have a 0
    mean. We get that the eigenassays are the
    principal components of the expression profiles.
  • In both cases, we get that the square of the
    singular values are proportional to the variance
    of the principal components.

26
SVD Special Property
X
S(r)
X(r)
where r is the number of non-null rows
27
Applications of SVD
  • Detects redundancies and allows for the
    representation of the data with the minimal set
    of essential features (components). These
    features can themselves represent signals (e.g.
    cell-cyle).
  • Data visualization. SVD can identify subspaces
    that capture most of the variance in the data
    which allows for the visualization of
    high-dimensional data in 1, 2 or 3-dimensional
    subspace.
  • Signal extraction in noisy data.

28
Essential Features
Alter et al. (2000), PNAS, 97(18) 10101-10106
29
Data Visualization
Yeung and Ruzzo. (2001), Bioinformatics, 17(9)
763-774
30
Support Vector Machines (SVM)
  • Instead of trying to identify clusters directly
    in the data, we assume the genes are already
    pre-clustered into different classes. The goal is
    the find a model that best predicts these
    classes.
  • We need to find the hyperplane that best divide
    the data points.
  • We must do so while minimize the error rates of
    the predictions.

31
References
  • Clustering
  • deRisi et al. (1997), Science, 278(5338)
    680-686.
  • Eisen et al. (1998), PNAS, 95(25) 14863-14868.
  • SVD and PCA
  • Alter et al. (2000), PNAS, 97(18) 10101-10106.
  • Holter et al. (2000), PNAS, 97(15) 8409-8414.
  • Yeung and Ruzzo. (2001), Bioinformatics, 17(9)
    763-774.
  • Wall et al. (2003), A Pratical Approach to
    Microarray Data Analysis, Chapter 5.
  • SVM
  • Brown et al. (2000), PNAS, 97(1), 262-267.

32
Outline
  • Microarray data analysis techniques
  • Clustering hierarchical and k-means
  • SVD and PCA
  • SVM Support Vector Machines
  • Gene network modeling
  • Boolean networks
  • Bayesian models
  • Differential equations

33
Problem
Time series
34
Boolean Networks
  • Genes are assumed to be ON or OFF.
  • At any given time, combining the gene states
    gives a gene activity pattern (GAP).
  • Given a GAP at time t, a deterministic function
    (a set of logical rules) provides the GAP at time
    t 1.
  • GAPs can be classified into attractor and
    transient states.

35
Boolean Network Example
t 0 1 2 3 4
x1 1 1 0 1 1
x2 1 0 0 0 0
x3 1 0 1 1 0
t 0 1 2 3 4
x1 1
x2 1
x3 1
x1
x2
x3
t
x2
x1
x3
t1
or
nor
nand
36
State Space
Picture generated using the program
DDLab. Wuensche,A., (1998), Proceedings of
Complex Systems '98 .
37
Boolean Network Example
I. Shmulevich et al., Bioinformatics (2002), 18
(2) 261-274
38
Issues with Boolean Networks
  • Gene trajectories are continuous and modeling
    them as ON/OFF might be inadequate.
  • A deterministic set of logical rules forces a
    very stringent model.
  • It doesnt allow for external input.
  • Very susceptible to noise.
  • Probability Boolean Networks aims at fixing some
    of these issues by combining multiple sets of
    rules (related to Bayesian Networks).

39
Threshold(s)
ON
OFF
40
Bayesian Networks
  • A gene regulatory network is represented by
    directed acyclic graph
  • Vertices correspond to genes.
  • Edges correspond to direct influence or
    interaction.
  • For each gene xi, a conditional distribution
    p(xi ancestors(xi) ) is defined.
  • The graph and the conditional distributions,
    uniquely specify the joint probability
    distribution.

41
Bayesian Network Example
Conditional distributions p(x1), p(x2), p(x3
x2), p(x4 x1,x2), p(x5 x4)
p(X) p(x1) p(x2 x1) p(x3 x1,x2) p(x4 x1,x2,
x3) p(x5 x1,x2, x3,x4) p(X) p(x1) p(x2) p(x3
x2) p(x4 x1,x2) p(x5 x4)
42
Learning Bayesian Models
  • Using gene expression data, the goal is to find
    the bayesian network that best matches the data.
  • Recovering optimal conditional probability
    distributions when the graph is known is easy.
  • Recovering the structure of the graph is NP-hard.
  • But, good statistics are available
  • What is the likelihood of a specific assignment?
  • What is the distribution of xi given xj?

43
Issues with Bayesian Models
  • Computationally intensive.
  • Requires lots of data.
  • Does not allow for feedback loops which are known
    to play an important role in biological networks.
  • Does not make use of the temporal aspect of the
    data.
  • Dynamical Bayesian Networks aim at solving some
    of these issues but they require even more data.

44
Differential Equations
  • Typically uses linear differential equations to
    model the gene trajectoriesdxi(t) / dt a0
    ai,1 x1(t) ai,2 x2(t) ai,n xn(t)
  • Several reasons for that choice
  • lower number of parameters implies that we are
    less likely to over fit the data
  • sufficient to model complex interactions between
    the genes

45
Small Network Example
dx1(t) / dt 0.491 - 0.248 x1(t) dx2(t) / dt
-0.473 x3(t) 0.374 x4(t) dx3(t) / dt -0.427
0.376 x1(t) - 0.241 x3(t) dx4(t) / dt 0.435
x1(t) - 0.315 x3(t) - 0.437 x4(t)
46
Small Network Example
_
x1
_

_
x2
x3


_
x4
one interaction coefficient
_
dx1(t) / dt 0.491 - 0.248 x1(t) dx2(t) / dt
-0.473 x3(t) 0.374 x4(t) dx3(t) / dt -0.427
0.376 x1(t) - 0.241 x3(t) dx4(t) / dt 0.435
x1(t) - 0.315 x3(t) - 0.437 x4(t)
47
Small Network Example
constant coefficients
dx1(t) / dt 0.491 - 0.248 x1(t) dx2(t) / dt
-0.473 x3(t) 0.374 x4(t) dx3(t) / dt -0.427
0.376 x1(t) - 0.241 x3(t) dx4(t) / dt 0.435
x1(t) - 0.315 x3(t) - 0.437 x4(t)
48
Problem Revisited
a0,i a1,i a2,i a3,i a4,i
x1 .431 -.248 0 0 0
x2 0 0 0 -.473 .374
x3 -.427 .376 0 -.241 0
x4 0 .435 0 -.315 -.437
Given the time-series data, can we find the
interactions coefficients?
49
Issues with Differential Equations
  • Even under the simplest linear model, there are
    m(m1) unknown parameters to estimate
  • m(m-1) directional effects
  • m self effects
  • m constant effects
  • Number of data points is mn and we typically have
    that n ltlt m (few time-points).
  • To avoid over fitting, extra constraints must be
    incorporated into the model such as
  • Smoothness of the equations
  • Sparseness of the network (few non-null
    interaction coefficients)

50
Algorithm for Network Inference
  • To recover the interaction coefficients, we use
    stepwise multiple linear regression.
  • Why?
  • This procedure only finds coefficient that
    significantly improve the fit in the regression.
    Hence it limits the number of non-zero
    coefficients (i.e. it finds sparse networks) a
    feature we were seeking.
  • It is highly flexible and provides p-value scores
    which can be interpreted easily.

51
Partial F Test
  • The procedure finds the interaction coefficients
    iteratively for each gene xi.
  • A partial F test is constructed to compare the
    total square error of the predicted gene
    trajectory with a specific subset of coefficients
    being added or removed.
  • If the p-value obtained from the test exceeds a
    certain cutoff, the subset of coefficients is
    significant and will be added or removed.
  • The procedures iterates until no more subsets of
    coefficients are either added or removed.

52
Simulations
  • Difficult to find coefficients that will produce
    realistic gene trajectories.
  • We select coefficients such that the resulting
    trajectories satisfy 3 conditions
  • They are bounded
  • The correlation of any pair is not too high
  • They are not too stable
  • We add gaussian noise to model errors.

53
Noise
54
Network Inference
a0,i a1,i a2,i a3,i a4,i
x1 .431 -.248 0 0 0
x2 0 0 0 -.473 .374
x3 -.427 .376 0 -.241 0
x4 0 .435 0 -.315 -.437
55
10 Genes
Procedure also recovers perfectly this network
with 10 genes and 22 interactions coefficients.
56
Multiple Networks
57
Multiple Network Problem
  • Multiple networks related by a graph or a tree
    can arise from various situations
  • Different species
  • Different developments stages
  • Different tissues
  • The goal is now not only to maximize the fit
    (with as few interactions as possible) but also
    to minimize an evolutionary score on the graph of
    the network.

58
Multiple Network Inference
  • The stepwise regression algorithm is modified to
    act directly on the edges of the graph and to
    take into account the evolutionary score.
  • The inference is done concurrently in all the
    networks.
  • Results the comparative framework actually
    simplifies the inference process especially when
    more genes or noise are involved.

59
Simulation Tests
60
Simple?
61
References
  • Boolean Networks
  • Kauffman (1993), The Origins of Order
  • Lian et al. (1998), PSB, 3 18-29.
  • Bayesian Networks
  • Friedman et al. (2000), RECOMB 2000.
  • Hartemink et al. (2001), PSB, 6 422-433.
  • Differential Equations
  • Chen et al. (1999), PSB, 4 29-40.
  • Dhaeseleer et al. (1999), PSB, 4 41-52.
  • Yeung et al. (2002), PNAS, 99(9) 6163-6168.
  • Literature Review
  • De Jong (2002). JCB, 9(1) 67-103.
Write a Comment
User Comments (0)
About PowerShow.com