Beyond Co-expression: Gene Network Inference

About This Presentation

Title:

Beyond Co-expression: Gene Network Inference

Description:

Beyond Co-expression: Gene Network Inference Patrik D haeseleer Harvard University http:/genetics.med.harvard.edu/~patrik – PowerPoint PPT presentation

Number of Views:217

Avg rating:3.0/5.0

Slides: 57

Provided by: pdha

Category:

more less

Transcript and Presenter's Notes

Title: Beyond Co-expression: Gene Network Inference

1
Beyond Co-expressionGene Network Inference

Patrik Dhaeseleer
Harvard University
http/genetics.med.harvard.edu/patrik

2
Beyond Co-expression

Clustering approaches rely on co-expression of
genes under different conditions
Assumes co-expression is caused by co-regulation
We would like to do better than that
Causal inference
What is regulating what?

3
Gene Network Inference
4
Overview

Modeling Issues
Level of biochemical detail
Boolean or continuous?
Deterministic or stochastic?
Spatial or non-spatial?
Data Requirements
Linear Models
Nonlinear models
Conclusions

5
Level of Biochemical Detail

Detailed models require lots of data!
Highly detailed biochemical models are only
feasible for very small systems which are
extensively studied
Example Arkin et al. (1998), Genetics
149(4)1633-48
lysis-lysogeny switch in Lambda
5 genes, 67 parameters based on 50 years of
research, stochastic simulation required
supercomputer

6
Example Lysis-Lysogeny
Arkin et al. (1998), Genetics 149(4)1633-48
7
Level of Biochemical Detail

In-depth biochemical simulation of e.g. a whole
cell is infeasible (so far)
Less detailed network models are useful when data
is scarce and/or network structure is unknown
Once network structure has been determined, we
can refine the model

8
Boolean or Continuous?

Boolean Networks (Kauffman (1993), The Origins of
Order) assumes ON/OFF gene states.
Allows analysis at the network-level
Provides useful insights in network dynamics
Algorithms for network inference from binary data

9
Boolean or Continuous?

Boolean abstraction is poor fit to real data
Cannot model important concepts
amplification of a signal
subtraction and addition of signals
compensating for smoothly varying environmental
parameter (e.g. temperature, nutrients)
varying dynamical behavior (e.g. cell cycle
period)
Feedback control
negative feedback is used to stabilize expression
?? causes oscillation in Boolean model

10
Deterministic or Stochastic?

Use of concentrations assumes individual
molecules can be ignored
Known examples (in prokaryotes) where stochastic
fluctuations play an essential role (e.g.
lysis-lysogeny in lambda)
Requires stochastic simulation (Arkin et al.
(1998), Genetics 149(4)1633-48), or modeling
molecule counts (e.g. Petri nets, Goss and
Peccoud (1998), PNAS 95(12)6750-5)
Significantly increases model complexity

11
Deterministic or Stochastic?

Eukaryotes larger cell volume, typically longer
half-lives. Few known stochastic effects.
Yeast 80 of the transcriptome
is expressed at 0.1-2 mRNA
copies/cell
Holstege, et al.(1998),
Cell 95717-728.
Human 95 of transcriptome is
expressed at lt5 copies/cell
Velculescu et al.(1997), Cell 88243-251

12
Spatial or Non-Spatial

Spatiality introduces additional complexity
intercellular interactions
spatial differentiation
cell compartments
cell types
Spatial patterns also provide more data
e.g. stripe formation in Drosophila
Mjolsness et al. (1991), J. Theor. Biol. 152
429-454.
Few (no?) large-scale spatial gene expression
data sets available so far.

13
Overview

Modeling Issues
Level of biochemical detail
Boolean or continuous?
Deterministic or stochastic?
Spatial or non-spatial?
Data Requirements
Linear Models
Nonlinear models
Conclusions

14
Overview

Modeling Issues
Data Requirements
Lower bounds from information theory
Effect of limited connectivity
Comparison with clustering
Variety of data points needed
Linear Models
Nonlinear models
Conclusions

15
Lower Bounds from Information Theory

How many bits of information are needed just to
specify the connection pattern of a network?
N2 possible connections between N nodes
? N2 bits needed to specify which connections
are present or absent
O(N) bits of information per data point
? O(N) data points needed

16
Effect of Limited Connectivity

Assume only K inputs per gene (on average)
? NK connections out of N2 possible
possible connection patterns
Number of bits needed to fully specify the
connection pattern
? O(Klog(N/K)) data points needed

17
Comparison with clustering

Use pairwise correlation comparisons as a
stand-in for clustering
As number of genes increases, number of false
positives will increase as well ? need to use
more stringent correlation test
If we want to use the same correlation cutoff
value r, we need to increase the number of data
points as N increases
? O(log(N)) data points needed

18
Summary

Fully connected N (thousands)
Connectivity K Klog(N/K) (hundreds?)
Clustering log(N) (tens)
Additional constraints reduce data requirements
limited connectivity
choice of regulatory functions
Network inference is feasible, but does require
much more data than clustering

19
Variety of Data Points Needed

To unravel regulation of a gene, need to sample
many different combinations of its regulatory
inputs (different environmental conditions and
perturbations)
Time series data yields dynamics, but all data
points are related
Steady-state data yields attractors, can sample
state space more efficiently
Both types of data will be needed, and multiple
data sets of each

20
Overview

Modeling Issues
Data Requirements
Linear Models
Formulation
Underdetermined problem!
Solution 1 reduce N
Solution 2
Nonlinear models
Conclusions

21
Linear Models

Basic model weighted sum of inputs
Simple network representation
Only first-order approximation
Parameters of the model
weight matrix containing NxN interaction
weights
Fitting the model find the parameters wji, bi
such that model best fits available data

22
Underdetermined problem!

Assumes fully connected network need at least as
many data points (arrays, conditions) as
variables (genes)!
Underdetermined (underconstrained, ill-posed)
model we have many more parameters than data
values to fit
No single solution, rather infinite number of
parameter settings that will all fit the data
equally well

23
Solution 1 reduce N

Rather than trying to model all genes, we can
reduce the dimensionality of the problem
Network of clusters construct a linear model
based on the cluster centroids
rat CNS data (4 clusters) Wahde and Hertz
(2000), Biosystems 55, 1-3129-136.
yeast cell cycle (15-18 clusters) Mjolsness et
al.(2000), Advances in Neural Information
Processing Systems 12 van Someren et al.(2000)
ISMB2000, 355-366.
Network of Principle Components linear model
between characteristic modes of the data
Holter et al.(2001), PNAS 98(4)1693-1698.

24
Solution 2

Take advantage of additional information
replicates
accuracy of measurements
smoothness of time series
Most likely, the network will still be poorly
constrained.
? Need a method to identify and extract those
parts of the model that are well-determined and
robust

25
Whats next?

Regulatory motifs
once we have identified the corresponding DNA
binding
proteins (transcription factors), we can start
building the
gene network from there
Integration with other data
transcription factors
functional annotation
known interactions in the literature
protein-protein interactions
protein expression levels
genetic data
...

26
Linking Regulatory Motifs to Expression Data

Patrik Dhaeseleer
Harvard University
http/genetics.med.harvard.edu/patrik

27
Introduction

Gene expression is regulated by Transcription
Factors (TFs), that bind to specific regulatory
motifs in the promoter region of the gene.
Synonyms regulatory element, regulatory
sequence, promoter elements, promoter motifs,
(TF) binding site, operator (in prokaryotes),
Question Do genes with similar expression
patterns share regulatory motifs?

28
1 Systematic Determination of Genetic Network
Architecture
Time-point 1
Tavazoie et al., Nature Genetics 22, 281 285
(1999)
Normalized Expression
Time-point 3
Time -point
Time-point 2
Normalized Expression
Normalized Expression
Time -point
Time -point
29
Search for Motifs in Promoter Regions
5- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTC
ATGAGAAAAGAGTCAGACATCGAAACATACAT
HIS7
ARO4
5- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTG
CATTTTGTACGTTACTGCGAAATGACTCAACG
ILV6
5- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTC
GCATCGCCGAAGTGCCATAAAAAATATTTTTT
5- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAA
AATTTTCGACAAAATGTATAGTCATTTCTATC
THR4
ARO1
5- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAAT
TGTCATGCATATGACTCATCCCGAACATGAAA
5- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTA
GAGAAAAATAGAAAAGCAGAAAAAATAAATAA
HOM2
5- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTC
TTTTTTGGAAAGTGTGGCATGTGCTTCACACA
PRO3
300-600 bp of upstream sequence per gene are
searched in Saccharomyces cerevisiae.
30
Best Motif Found by AlignACE
31
N182
32
Systematic Determination of Genetic Network
Architecture

Tavazoie et al., Nature Genetics 22, 281285
(1999)
Most motifs found are highly selective for the
cluster they were found in.
Can find many known binding sites for
transcription factors.
Also finds many novel regulatory motifs,
associated with specific functional categories.
1) cluster
2) identify regulatory motifs in clustered genes
3) identify TFs that bind to those motifs
? Gene regulation network

33
2 Regulatory Element Detection Using Correlation
with Expression
Bussemaker et al., Nature Genetics 27, 167 174
(2001)

What is the contribution of each regulatory motif
(or the TF that binds to that motif) to the
expression level of the genes containing the
motif?
Given a set of known or putative regulatory
motifs, identify all genes that contain the motif
in their promoter region.
For a single expression experiment (e.g. single
point in a time series), is the presence of the
motif correlated with the expression level of the
genes?
Perform multiple regression of (log) expression
level on the presence/absence of the motifs.
Plot contribution of motif throughout time series.

34
Contribution of Motifs to Expression Levels
35
Linear Combination of Motif Contributions

Find the most highly correlated motif.
Determine its contribution Fi to expression level
by linear regression.
Subtract its contribution from the expression
levels.
Find the next highest correlated motif.
Repeat until no more significantly correlated
motifs.
Repeat this entire analysis for each time point
of a time series ? weights Fi for the individual
motifs will change throughout he time course.

36
Time Courses of Regulatory Signals

We can think of the time-varying contributions Fi
of each motif as the Regulatory Signals of the
transcription factors that bind to these motifs

37
Regulatory Element Detection Using Correlation
with Expression

Bussemaker et al., Nature Genetics 27, 167174
(2001)
Can be used with known regulatory motifs, sets of
putative motifs, and even exhaustively on the set
of all motifs up to a certain length (n7).
Known motifs generally have high statistical
significance.
Allows us to infer regulatory inputs of (possibly
unknown) transcription factors.
Accounts for only 30 of total signal present in
genome-wide expression patterns.
Purely linear model no synergistic effects
between TFs, cooperative binding, etc.

38
3 Identifying Regulatory Networks by
Combinatorial Analysis of Promoter Elements
Pilpel et al., Nature Genetics 29, 153159 (2001)

Most transcription factors are thought to work in
concert with other TFs.
? Synergistic effects
Clustering
a motif may occur in more than one cluster,
because it may give rise to different expression
patterns depending on its interaction partners.
several motifs may occur in the same cluster.
Correlation with expression pattern
by itself, a motif may not show a clear
expression pattern.
contributions of multiple motifs may not be
simply additive.

39
Synergy between Mcm1 and SFF in Cell Cycle Data
Set
Mcm1 and SFF were not detected in Tavazoie et
al Yet TFs that bind these motifs are known to
interact in control of G2-genes (Nature. 2000
40690-4.) Bussemaker et al found that these
motifs are antagonistic.
40
Expression Coherence and Synergy

Expression Coherence (EC) score indicates how
tightly clustered the expression profiles of a
set of genes are.
For every combination of N2,3 motifs

1) Calculate the expression coherence score of
the genes that have the N motifs 2) Calculate the
expression coherence score of genes that have
every possible subset of N-1 motifs 3 )Test
(statistically) the hypothesis that the score of
the orfs with N motifs is significantly higher
than that of orfs that have any sub set of N-1
motifs
41
The Combinogram
Highly synergistic interaction between MCB and
SFF Previously unknown Subsequently predicted via
chromatin immuno-precipitation (ChIP)
(cell cycle data)
42
Identifying Regulatory Networks by Combinatorial
Analysis of Promoter Elements

Pilpel et al., Nature Genetics 29, 153159 (2001)
Found several known and novel interactions
between regulatory elements active in cell cycle,
sporulation and stress response.
Doesnt assume a specific (e.g. linear) model of
TF interactions.
Combined with TF expression patterns, may allow
us to infer a model of interaction.

43
Protein Networks

Patrik Dhaeseleer
Harvard University
http/genetics.med.harvard.edu/patrik

44
Yeast 2-Hybrid Assays
Transcription Factor (e.g. Gal4)
MATa
MATa

Reconstituted active TF
bait fusion
prey fusion
Fields and Song, Nature 340245-246 (1989)
45
Large-Scale 2-Hybrid Data Sets

Uetz et al, Nature 403623-627 (2000)
6000 x 192 protein pairs screened using protein
array
nearly all 6000 x 6000 pairs, using pooled prey
libraries
total of 957 putative interactions between 1004
proteins
Ito et al, PNAS 984569-4574 (2001)
nearly all 6000 x 6000 pairs, using bait and prey
pools
total of 4549 putative interactions between 3278
proteins
core set of 841 interactions between 797 proteins
Surprisingly little overlap between the data
sets, possibly indicating a large number of
missed interactions (false negative).

46
Intersections between Protein Interaction Data
Sets
MIPS 1546
MIPS 1546
Ito full 4475
1415
1436
Ito core 806
49
28
54
28
21
61
648
109
156
709
4242
756
Uetz 947
Uetz 947
47
Causes of False Positives

Bait acts as activator
Bait interacts with endogenous activator
Prey binds to DNA
Prey interacts with endogenous transcription
factors
Bait interacts with Activation Domain
Prey interacts with DNA Binding Domain
Sticky proteins (nonspecific binding)
Changes in plasmid copy number
Various other artifacts
. . .

48
Yeast Protein-Protein Interaction Map
Each node is a protein Each line is an
interaction 5560 putative interactions 3725
different proteins 3 interactions / protein
Uetz, Schwikowski, Fields and co-workers Ito and
co-workers
49
Membrane Proteins
Transcription Factors
50
- membrane protein - DNA-binding protein - all
other yeast proteins - physical interaction
between two proteins
51
Problem How to Rank Possible Pathways?
Ste2/3
Possible Paths from Ste3 1045 different paths
to 143 transcription factors
Ste12
52
Rank Predicted Paths by Degree of Expression
Correlation from Microarray Expreriments

Known pathways often show correlated expression
Known interacting proteins often show correlated
expression

Average Pairwise Correlation Coefficient Among
Pathway Members
STE3?AKR1?STE5?STE4?FAR1?CDC24?SOH1
0.190 STE3?AKR1?IQG1?CDC42?BEM4?RHO1?SKN7
0.059 STE3?AKR1?STE4?FAR1?FUS3?DIG2?STE12
0.281 STE3?AKR1?GCS1?YGL198W?SAS10?NET1 -0.106
Microarray Data Downloaded from Rosetta
Inpharmatics
53
Classical View of MAPK Pathways
adapted from C.Roberts, et al., Science, 287, 873
(2000)
54
The Protein Network View

Highly interconnected, not just a linear pathway!
Some proteins are missing from the protein
interaction data sets (Cdc42, Ste20).
Includes several additional proteins (especially
Akr1, Kss1).

55
Conclusions

Protein interaction data and expression data are
both noisy. Combining them increases the
accuracy.
Can estimate protein interaction error rates by
looking at consistency between data sets ?
probabilistic interaction model (work in
progress).
Pathways are far more interconnected than often
portrayed.
Can integrate various other forms of data
co-localization of proteins
homology with known interacting proteins
Rosetta Stone method

56
Acknowledgments
Roland Somogyi Stefanie FuhrmanXiling Wen
NCGR Jason Stewart Pedro Mendes
Harvard Tzachi PilpelMartin SteffenAllegra
PettiJohn AachGeorge Church
UNM Stephanie ForrestAndreas WagnerDavid
PeabodyBarak Pearlmutter
The Santa Fe Institute

Write a Comment

User Comments (0)

About PowerShow.com

Beyond Co-expression: Gene Network Inference - PowerPoint PPT Presentation

Beyond Co-expression: Gene Network Inference

Beyond Co-expression: Gene Network Inference Patrik D haeseleer Harvard University http:/genetics.med.harvard.edu/~patrik – PowerPoint PPT presentation