Title: Learning gene regulatory networks in Arabidopsis thaliana
1- Learning gene regulatory networks in Arabidopsis
thaliana - Chris Needham, Andy Bulpitt
- School of Computing
- Iain Manfield, Phil Gilmartin
- Institute of Integrative and Comparative Biology
- David Westhead
- Institute of Molecular and Cellular Biology
2Gene Regulatory Networks
- GRNs govern the functional development and
biological processes of cells in all organisms. - GRNs are a representation that encapsulate all
info about gene regulation - Incorporating time, conditions, development
- We aim to learn transcription networks for
components of Arabidopsis thaliana from gene
expression microarray data.
3Gene Expression Microarrays
transcription
translation
DNA
mRNA
protein
microarrays
genes
experiments
4Arabidopsis thaliana
- Plants are important
- Arabidopsis
- is the best annotated plant (poor rel. to yeast)
- has excellent large uniform microarray dataset
- has a large genome of 30000 genes with many
large gene families duplications - has many mutants
- analysis often not very successful
- has many transcription factors (TFs)
- what do they do?
- even well characterised TFs are not
fully-characterised
5Arabidopsis GATA Factor genes
Night-phased Clock regulation
Light Up-regulated
Day-phased Clock regulation
Light Down-regulated
Inconsistent Clock regulation of GATA2 and GATA4
between experiments
6Biological approach
- The experimental biological work involved to
discover regulatory networks is hard expensive - mutants in TFs
- microarray experiments
- time course experiments
- How do poorly-characterised genes fit into
well-characterised networks? such as - Light up-regulation, Light down-regulation,
Clock, Abiotic stress
What can we get from the existing data?
7Informatics approaches
Ordinary Differential Equations Dynamical
Systems Boolean networks Logical relations
between genes Bayesian networks Modelling a
stochastic system Friedman, Inferring cellular
networks using probabilistic graphical models.
Science 303(6). 2004. Review article. Imoto et
al. Combining microarrays and biological
knowledge for estimating gene networks via
Bayesian networks. CSB 2003. Incorporate prior
knowledge from protein-protein interactions,
protein-DNA interactions, gene networks and
literature. Analysis of Saccharomyces cerevisiae
gene expression data newly obtained by disrupting
100 genes, mainly transcription factors. Sachs
et al. Causal protein signalling networks derived
from multi-parameter single-cell data. Science
308(5721) 2005.
8- Meaningful gene regulatory networks can be
learned from microarray data - without interventions
- but using large datasets
- publicly available
- start to design before extra data collection
9Data Arabidopsis thaliana
- 2466 Microarrays (NASC) 25,000 genes
- Filtering
- Genes with low entropy are removed.
- Can select a subset of genes to consider
- Quantisation
- Expression signal values discretised into 2 or 3
classes. - Boundaries chosen to create classes with equal
probability masses.
825
819
822
GATA2 AT2G45050
21.9
48.6
10Bayesian networks
- BNs are a framework for explaining causal
relationships consisting of a set of variables
connected by a set of directed edges - Probability calculus is used to describe the
probabilistic relationship of each variable with
its parents
- The joint probability distribution over all the
variables can be written as a product of
conditional probability distributions - p(x1,xn) p(xipai)
- where pai are the parents of xi
p(x1,,x7) p(x1)p(x2)p(x3)p(x4x1,x2,x3)p(x5x1,
x3)p(x6x4)p(x7x4,x5)
11Conditional Probability Distributions
p(xipai)
Conditional probability tables for GATA4
Marginal probabilities for GATA4
12Structure Learning
- Aim is to find the model (network structure) that
has the maximum likelihood for a given set of
genes (nodes) - For a given set of genes, likelihood L
P(DS,?S) is the probability of the data D being
generated by the model
- To search for a good model structure, a greedy
learning algorithm is used. From an initial
network, edges are added, reversed or deleted
until an optimum is reached.
Learned structure S arg maxS ln p(D?S,S) ½
d ln N
The BIC score has a measure of how well the model
fits the data, and a penalty term to penalise
model complexity. ?S is an estimate of the model
parameters for the structure S, d is the number
of model parameters, and N is the size of the
dataset.
13Conditional Independence
- The different structures encode the conditional
independences between the genes. - Causality the directionality of the arrows can
be determined when they lead into a v-structure
the gene at the v depends on all of its parents. - Otherwise, the direction of the causal relation
between genes cannot be discovered from data
alone. Interventions can be used. - i.e. test using mutants in the respective genes
to see which gene is mis-regulated in which
mutant. (transcript levels)
14Method
An initial set of key genes of interest is chosen
and a network structure inferred e.g.
Circadian clock regulated
To this model a number of genes may be
added. Genes are added separately
. . .
Either all genes, or a selection
The structure learning algorithm is applied to
each set of genes, finding the GRN which is most
likely to have generated the data
. . .
The best network structure is chosen, and the
gene is added to the model
15Results
- Meaningful gene regulatory networks can be
learned from microarray data - without interventions
- but using large datasets
- publicly available
- start to design before extra data collection
16Predictive models
Figure 2. Given information about the state of a
genes expression level (or set of genes), the
marginal probability of any other gene (or set of
genes) being in a particular state may be
calculated. Fixing of the value of a gene (in
this case through growing a specific mutant)
allows predictions about the likely values of
other genes to be made and tested experimentally
to verify the predictive model of the GRN. This
figure shows the change in marginal likelihood of
each gene (y-axis) in Figure 1 when one other
genes value is fixed (x-axis), based on real
data, and the learned network in Figure 1. Dark
values show greatest expected change in
expression levels, whereas white values show
little observable change.
Figure 1. Bayesian network of the transcription
network for forty genes identified in light/clock
regulation of selected GATAs from the literature.
17Future Computation
- New structure learning algorithms
- Strength of connections
- Selecting relevant experiments
- Effect of discretisation
- Sensitivity to noise
18Future Biology
- We wish to learn GRNs in order to form hypotheses
about possible roles of a gene and likely
redundant genes. - Main aim is to reduce the number of related genes
to be screened for experimental verification of
findings. - Look for mis-regulation of genes predicted to be
downstream of e.g. well characterised regulators. - Make mutants of poorly characterised genes and
look for mis-regulation of gene expression or
other phenotype. - Carry these predictions from this model organism
to a crop plant, e.g. rice, where many of the
regulatory components are conserved.
19Conclusions
20Acknowledgments
- Paul Devlin, Enrique Lopez RHUL.
- NASCArrays team.
- People contributing samples for array analysis at
NASC. - BBSRC, University of Leeds.
21- Extra slides
- And
- Slides pre-empting questions.
22Benchmarks for assessment of network accuracy
23Generating testable hypotheses
- Can we generate hypotheses using gene expression
data? - Genevestigator Tool
- OK for small numbers of genes
- ACT Arabidopsis Co-expression Tool
- co-expressed ? co-regulated
- What is the regulator?
24Arabidopsis thaliana
- Many transcription factors (TFs)
- what do they do?
- even well characterised TFs are not fully-
characterised - Many mutants
- analysis often not very successful
25Co-expression and Co-regulation
Promoter motif over-representation
26GATA factors and abiotic stress
27Meristem de-etiolation arrays
- Etiolated (dark-grown) seedlings.
- Time course array analysis of meristem and
cotyledons following illumination. - Expression of selected GATA genes.
- Enrique Lopez, Royal Holloway.
28Co-expression scatter plots
GATA2 and GATA4 are co-expressed with phyA but
not with other phy genes. Expression of most
genes show similar correlation of expression with
GATA2 and GATA4, suggesting conservation of
expression pattern following gene duplication.
GATA9 and 12 do not show co-expression with any
of the well-characterised genes seen with GATA2
and 4. Divergence of expression following
duplication is indicated by correlation of some
genes with GATA9 but not with GATA12, perhaps
leading to sub-functionalization.
GATA21 (GNC) and 22 are co-expressed with lhy and
cca1, suggesting these GATA genes may fit within
characterised pathways. Expression of most genes
is better correlated with GATA21 than GATA22.
This may reflect a subtle divergence of
expression pattern for these duplicated genes.
29AtGATA gene conservation
30Expression divergence
31Phenotypes of GATA mutants
32Leave one out network learning
Clock genes CCA1, LHY, TOC1, GI, ELF3
ELF4. Subsidiary list CBF1, COL1, PHYA, PIF3
HY5.
33Networks from expression correlation r-values
- A set of genes from a microarray experiment.
- Find the r-values for correlation of expression
between all these genes. - Connect genes with high r-values.
- Gordon Breen, University of Bristol
34Results
- Meaningful gene regulatory networks can be
learned from microarray data - without interventions
- but using large datasets
- publicly available
- start to design before extra data collection
35Well-characterised networks
- Light up-regulation
- Light down-regulation
- Clock
- Abiotic stress
- How do poorly-characterised genes fit into these
networks?