Title: Beyond Co-expression: Gene Network Inference
1Beyond Co-expressionGene Network Inference
- Patrik Dhaeseleer
- Harvard University
- http/genetics.med.harvard.edu/patrik
2Beyond Co-expression
- Clustering approaches rely on co-expression of
genes under different conditions - Assumes co-expression is caused by co-regulation
- We would like to do better than that
- Causal inference
- What is regulating what?
3Gene Network Inference
4Overview
- Modeling Issues
- Level of biochemical detail
- Boolean or continuous?
- Deterministic or stochastic?
- Spatial or non-spatial?
- Data Requirements
- Linear Models
- Nonlinear models
- Conclusions
5Level of Biochemical Detail
- Detailed models require lots of data!
- Highly detailed biochemical models are only
feasible for very small systems which are
extensively studied - Example Arkin et al. (1998), Genetics
149(4)1633-48 - lysis-lysogeny switch in Lambda
- 5 genes, 67 parameters based on 50 years of
research, stochastic simulation required
supercomputer
6Example Lysis-Lysogeny
Arkin et al. (1998), Genetics 149(4)1633-48
7Level of Biochemical Detail
- In-depth biochemical simulation of e.g. a whole
cell is infeasible (so far) - Less detailed network models are useful when data
is scarce and/or network structure is unknown - Once network structure has been determined, we
can refine the model
8Boolean or Continuous?
- Boolean Networks (Kauffman (1993), The Origins of
Order) assumes ON/OFF gene states. - Allows analysis at the network-level
- Provides useful insights in network dynamics
- Algorithms for network inference from binary data
9Boolean or Continuous?
- Boolean abstraction is poor fit to real data
- Cannot model important concepts
- amplification of a signal
- subtraction and addition of signals
- compensating for smoothly varying environmental
parameter (e.g. temperature, nutrients) - varying dynamical behavior (e.g. cell cycle
period) - Feedback control
- negative feedback is used to stabilize expression
- ?? causes oscillation in Boolean model
10Deterministic or Stochastic?
- Use of concentrations assumes individual
molecules can be ignored - Known examples (in prokaryotes) where stochastic
fluctuations play an essential role (e.g.
lysis-lysogeny in lambda) - Requires stochastic simulation (Arkin et al.
(1998), Genetics 149(4)1633-48), or modeling
molecule counts (e.g. Petri nets, Goss and
Peccoud (1998), PNAS 95(12)6750-5) - Significantly increases model complexity
11Deterministic or Stochastic?
- Eukaryotes larger cell volume, typically longer
half-lives. Few known stochastic effects. - Yeast 80 of the transcriptome
is expressed at 0.1-2 mRNA
copies/cell
Holstege, et al.(1998),
Cell 95717-728. - Human 95 of transcriptome is
expressed at lt5 copies/cell
Velculescu et al.(1997), Cell 88243-251
12Spatial or Non-Spatial
- Spatiality introduces additional complexity
- intercellular interactions
- spatial differentiation
- cell compartments
- cell types
- Spatial patterns also provide more data
- e.g. stripe formation in Drosophila
- Mjolsness et al. (1991), J. Theor. Biol. 152
429-454. - Few (no?) large-scale spatial gene expression
data sets available so far.
13Overview
- Modeling Issues
- Level of biochemical detail
- Boolean or continuous?
- Deterministic or stochastic?
- Spatial or non-spatial?
- Data Requirements
- Linear Models
- Nonlinear models
- Conclusions
14Overview
- Modeling Issues
- Data Requirements
- Lower bounds from information theory
- Effect of limited connectivity
- Comparison with clustering
- Variety of data points needed
- Linear Models
- Nonlinear models
- Conclusions
15Lower Bounds from Information Theory
- How many bits of information are needed just to
specify the connection pattern of a network? - N2 possible connections between N nodes
- ? N2 bits needed to specify which connections
are present or absent - O(N) bits of information per data point
- ? O(N) data points needed
16Effect of Limited Connectivity
- Assume only K inputs per gene (on average)
- ? NK connections out of N2 possible
- possible connection patterns
- Number of bits needed to fully specify the
connection pattern - ? O(Klog(N/K)) data points needed
17Comparison with clustering
- Use pairwise correlation comparisons as a
stand-in for clustering - As number of genes increases, number of false
positives will increase as well ? need to use
more stringent correlation test - If we want to use the same correlation cutoff
value r, we need to increase the number of data
points as N increases - ? O(log(N)) data points needed
18Summary
- Fully connected N (thousands)
- Connectivity K Klog(N/K) (hundreds?)
- Clustering log(N) (tens)
- Additional constraints reduce data requirements
- limited connectivity
- choice of regulatory functions
- Network inference is feasible, but does require
much more data than clustering
19Variety of Data Points Needed
- To unravel regulation of a gene, need to sample
many different combinations of its regulatory
inputs (different environmental conditions and
perturbations) - Time series data yields dynamics, but all data
points are related - Steady-state data yields attractors, can sample
state space more efficiently - Both types of data will be needed, and multiple
data sets of each
20Overview
- Modeling Issues
- Data Requirements
- Linear Models
- Formulation
- Underdetermined problem!
- Solution 1 reduce N
- Solution 2
- Nonlinear models
- Conclusions
21Linear Models
- Basic model weighted sum of inputs
- Simple network representation
- Only first-order approximation
- Parameters of the model
weight matrix containing NxN interaction
weights - Fitting the model find the parameters wji, bi
such that model best fits available data
22Underdetermined problem!
- Assumes fully connected network need at least as
many data points (arrays, conditions) as
variables (genes)! - Underdetermined (underconstrained, ill-posed)
model we have many more parameters than data
values to fit - No single solution, rather infinite number of
parameter settings that will all fit the data
equally well
23Solution 1 reduce N
- Rather than trying to model all genes, we can
reduce the dimensionality of the problem - Network of clusters construct a linear model
based on the cluster centroids - rat CNS data (4 clusters) Wahde and Hertz
(2000), Biosystems 55, 1-3129-136. - yeast cell cycle (15-18 clusters) Mjolsness et
al.(2000), Advances in Neural Information
Processing Systems 12 van Someren et al.(2000)
ISMB2000, 355-366. - Network of Principle Components linear model
between characteristic modes of the data - Holter et al.(2001), PNAS 98(4)1693-1698.
24Solution 2
- Take advantage of additional information
- replicates
- accuracy of measurements
- smoothness of time series
-
- Most likely, the network will still be poorly
constrained. - ? Need a method to identify and extract those
parts of the model that are well-determined and
robust
25Whats next?
- Regulatory motifs
- once we have identified the corresponding DNA
binding - proteins (transcription factors), we can start
building the - gene network from there
- Integration with other data
- transcription factors
- functional annotation
- known interactions in the literature
- protein-protein interactions
- protein expression levels
- genetic data
- ...
26Linking Regulatory Motifs to Expression Data
- Patrik Dhaeseleer
- Harvard University
- http/genetics.med.harvard.edu/patrik
27Introduction
- Gene expression is regulated by Transcription
Factors (TFs), that bind to specific regulatory
motifs in the promoter region of the gene. - Synonyms regulatory element, regulatory
sequence, promoter elements, promoter motifs,
(TF) binding site, operator (in prokaryotes), - Question Do genes with similar expression
patterns share regulatory motifs?
281 Systematic Determination of Genetic Network
Architecture
Time-point 1
Tavazoie et al., Nature Genetics 22, 281 285
(1999)
Normalized Expression
Time-point 3
Time -point
Time-point 2
Normalized Expression
Normalized Expression
Time -point
Time -point
29Search for Motifs in Promoter Regions
5- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTC
ATGAGAAAAGAGTCAGACATCGAAACATACAT
HIS7
ARO4
5- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTG
CATTTTGTACGTTACTGCGAAATGACTCAACG
ILV6
5- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTC
GCATCGCCGAAGTGCCATAAAAAATATTTTTT
5- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAA
AATTTTCGACAAAATGTATAGTCATTTCTATC
THR4
ARO1
5- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAAT
TGTCATGCATATGACTCATCCCGAACATGAAA
5- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTA
GAGAAAAATAGAAAAGCAGAAAAAATAAATAA
HOM2
5- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTC
TTTTTTGGAAAGTGTGGCATGTGCTTCACACA
PRO3
300-600 bp of upstream sequence per gene are
searched in Saccharomyces cerevisiae.
30Best Motif Found by AlignACE
31N182
32Systematic Determination of Genetic Network
Architecture
- Tavazoie et al., Nature Genetics 22, 281285
(1999) - Most motifs found are highly selective for the
cluster they were found in. - Can find many known binding sites for
transcription factors. - Also finds many novel regulatory motifs,
associated with specific functional categories. - 1) cluster
- 2) identify regulatory motifs in clustered genes
- 3) identify TFs that bind to those motifs
- ? Gene regulation network
332 Regulatory Element Detection Using Correlation
with Expression
Bussemaker et al., Nature Genetics 27, 167 174
(2001)
- What is the contribution of each regulatory motif
(or the TF that binds to that motif) to the
expression level of the genes containing the
motif? - Given a set of known or putative regulatory
motifs, identify all genes that contain the motif
in their promoter region. - For a single expression experiment (e.g. single
point in a time series), is the presence of the
motif correlated with the expression level of the
genes? - Perform multiple regression of (log) expression
level on the presence/absence of the motifs. - Plot contribution of motif throughout time series.
34Contribution of Motifs to Expression Levels
35Linear Combination of Motif Contributions
- Find the most highly correlated motif.
- Determine its contribution Fi to expression level
by linear regression. - Subtract its contribution from the expression
levels. - Find the next highest correlated motif.
- Repeat until no more significantly correlated
motifs. - Repeat this entire analysis for each time point
of a time series ? weights Fi for the individual
motifs will change throughout he time course.
36Time Courses of Regulatory Signals
- We can think of the time-varying contributions Fi
of each motif as the Regulatory Signals of the
transcription factors that bind to these motifs
37Regulatory Element Detection Using Correlation
with Expression
- Bussemaker et al., Nature Genetics 27, 167174
(2001) - Can be used with known regulatory motifs, sets of
putative motifs, and even exhaustively on the set
of all motifs up to a certain length (n7). - Known motifs generally have high statistical
significance. - Allows us to infer regulatory inputs of (possibly
unknown) transcription factors. - Accounts for only 30 of total signal present in
genome-wide expression patterns. - Purely linear model no synergistic effects
between TFs, cooperative binding, etc.
383 Identifying Regulatory Networks by
Combinatorial Analysis of Promoter Elements
Pilpel et al., Nature Genetics 29, 153159 (2001)
- Most transcription factors are thought to work in
concert with other TFs. - ? Synergistic effects
- Clustering
- a motif may occur in more than one cluster,
because it may give rise to different expression
patterns depending on its interaction partners. - several motifs may occur in the same cluster.
- Correlation with expression pattern
- by itself, a motif may not show a clear
expression pattern. - contributions of multiple motifs may not be
simply additive.
39Synergy between Mcm1 and SFF in Cell Cycle Data
Set
Mcm1 and SFF were not detected in Tavazoie et
al Yet TFs that bind these motifs are known to
interact in control of G2-genes (Nature. 2000
40690-4.) Bussemaker et al found that these
motifs are antagonistic.
40Expression Coherence and Synergy
- Expression Coherence (EC) score indicates how
tightly clustered the expression profiles of a
set of genes are. - For every combination of N2,3 motifs
1) Calculate the expression coherence score of
the genes that have the N motifs 2) Calculate the
expression coherence score of genes that have
every possible subset of N-1 motifs 3 )Test
(statistically) the hypothesis that the score of
the orfs with N motifs is significantly higher
than that of orfs that have any sub set of N-1
motifs
41The Combinogram
Highly synergistic interaction between MCB and
SFF Previously unknown Subsequently predicted via
chromatin immuno-precipitation (ChIP)
(cell cycle data)
42Identifying Regulatory Networks by Combinatorial
Analysis of Promoter Elements
- Pilpel et al., Nature Genetics 29, 153159 (2001)
- Found several known and novel interactions
between regulatory elements active in cell cycle,
sporulation and stress response. - Doesnt assume a specific (e.g. linear) model of
TF interactions. - Combined with TF expression patterns, may allow
us to infer a model of interaction.
43Protein Networks
- Patrik Dhaeseleer
- Harvard University
- http/genetics.med.harvard.edu/patrik
44Yeast 2-Hybrid Assays
Transcription Factor (e.g. Gal4)
MATa
MATa
Reconstituted active TF
bait fusion
prey fusion
Fields and Song, Nature 340245-246 (1989)
45Large-Scale 2-Hybrid Data Sets
- Uetz et al, Nature 403623-627 (2000)
- 6000 x 192 protein pairs screened using protein
array - nearly all 6000 x 6000 pairs, using pooled prey
libraries - total of 957 putative interactions between 1004
proteins - Ito et al, PNAS 984569-4574 (2001)
- nearly all 6000 x 6000 pairs, using bait and prey
pools - total of 4549 putative interactions between 3278
proteins - core set of 841 interactions between 797 proteins
- Surprisingly little overlap between the data
sets, possibly indicating a large number of
missed interactions (false negative).
46Intersections between Protein Interaction Data
Sets
MIPS 1546
MIPS 1546
Ito full 4475
1415
1436
Ito core 806
49
28
54
28
21
61
648
109
156
709
4242
756
Uetz 947
Uetz 947
47Causes of False Positives
- Bait acts as activator
- Bait interacts with endogenous activator
- Prey binds to DNA
- Prey interacts with endogenous transcription
factors - Bait interacts with Activation Domain
- Prey interacts with DNA Binding Domain
- Sticky proteins (nonspecific binding)
- Changes in plasmid copy number
- Various other artifacts
- . . .
48 Yeast Protein-Protein Interaction Map
Each node is a protein Each line is an
interaction 5560 putative interactions 3725
different proteins 3 interactions / protein
Uetz, Schwikowski, Fields and co-workers Ito and
co-workers
49Membrane Proteins
Transcription Factors
50- membrane protein - DNA-binding protein - all
other yeast proteins - physical interaction
between two proteins
51Problem How to Rank Possible Pathways?
Ste2/3
Possible Paths from Ste3 1045 different paths
to 143 transcription factors
Ste12
52 Rank Predicted Paths by Degree of Expression
Correlation from Microarray Expreriments
- Known pathways often show correlated expression
- Known interacting proteins often show correlated
expression
Average Pairwise Correlation Coefficient Among
Pathway Members
STE3?AKR1?STE5?STE4?FAR1?CDC24?SOH1
0.190 STE3?AKR1?IQG1?CDC42?BEM4?RHO1?SKN7
0.059 STE3?AKR1?STE4?FAR1?FUS3?DIG2?STE12
0.281 STE3?AKR1?GCS1?YGL198W?SAS10?NET1 -0.106
Microarray Data Downloaded from Rosetta
Inpharmatics
53Classical View of MAPK Pathways
adapted from C.Roberts, et al., Science, 287, 873
(2000)
54The Protein Network View
- Highly interconnected, not just a linear pathway!
- Some proteins are missing from the protein
interaction data sets (Cdc42, Ste20). - Includes several additional proteins (especially
Akr1, Kss1).
55Conclusions
- Protein interaction data and expression data are
both noisy. Combining them increases the
accuracy. - Can estimate protein interaction error rates by
looking at consistency between data sets ?
probabilistic interaction model (work in
progress). - Pathways are far more interconnected than often
portrayed. - Can integrate various other forms of data
- co-localization of proteins
- homology with known interacting proteins
- Rosetta Stone method
56Acknowledgments
Roland Somogyi Stefanie FuhrmanXiling Wen
NCGR Jason Stewart Pedro Mendes
Harvard Tzachi PilpelMartin SteffenAllegra
PettiJohn AachGeorge Church
UNM Stephanie ForrestAndreas WagnerDavid
PeabodyBarak Pearlmutter
The Santa Fe Institute