Learning gene regulatory networks in Arabidopsis thaliana - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Learning gene regulatory networks in Arabidopsis thaliana

Description:

Iain Manfield, Phil Gilmartin. Institute of Integrative and Comparative Biology. David Westhead ... GRNs govern the functional development and biological ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 19

Provided by: bmb5

Category:

more less

Transcript and Presenter's Notes

Title: Learning gene regulatory networks in Arabidopsis thaliana

1

Learning gene regulatory networks in Arabidopsis
thaliana
Chris Needham, Andy Bulpitt
School of Computing
Iain Manfield, Phil Gilmartin
Institute of Integrative and Comparative Biology
David Westhead
Institute of Molecular and Cellular Biology

2
Gene Regulatory Networks

GRNs govern the functional development and
biological processes of cells in all organisms.
GRNs are a representation that encapsulate all
info about gene regulation
Incorporating time, conditions, development
We aim to learn transcription networks for
components of Arabidopsis thaliana from gene
expression microarray data.

3
Gene Expression Microarrays
transcription
translation
DNA
mRNA
protein
microarrays
genes
experiments
4
Arabidopsis thaliana

Plants are important
Arabidopsis
is the best annotated plant (poor rel. to yeast)
has excellent large uniform microarray dataset
has a large genome of 30000 genes with many
large gene families duplications
has many mutants
analysis often not very successful
has many transcription factors (TFs)
what do they do?
even well characterised TFs are not
fully-characterised

5
Arabidopsis GATA Factor genes
Night-phased Clock regulation
Light Up-regulated
Day-phased Clock regulation
Light Down-regulated
Inconsistent Clock regulation of GATA2 and GATA4
between experiments
6
Biological approach

The experimental biological work involved to
discover regulatory networks is hard expensive
mutants in TFs
microarray experiments
time course experiments
How do poorly-characterised genes fit into
well-characterised networks? such as
Light up-regulation, Light down-regulation,
Clock, Abiotic stress

What can we get from the existing data?
7
Informatics approaches
Ordinary Differential Equations Dynamical
Systems Boolean networks Logical relations
between genes Bayesian networks Modelling a
stochastic system Friedman, Inferring cellular
networks using probabilistic graphical models.
Science 303(6). 2004. Review article. Imoto et
al. Combining microarrays and biological
knowledge for estimating gene networks via
Bayesian networks. CSB 2003. Incorporate prior
knowledge from protein-protein interactions,
protein-DNA interactions, gene networks and
literature. Analysis of Saccharomyces cerevisiae
gene expression data newly obtained by disrupting
100 genes, mainly transcription factors. Sachs
et al. Causal protein signalling networks derived
from multi-parameter single-cell data. Science
308(5721) 2005.
8

Meaningful gene regulatory networks can be
learned from microarray data
without interventions
but using large datasets
publicly available
start to design before extra data collection

9
Data Arabidopsis thaliana

2466 Microarrays (NASC) 25,000 genes
Filtering
Genes with low entropy are removed.
Can select a subset of genes to consider
Quantisation
Expression signal values discretised into 2 or 3
classes.
Boundaries chosen to create classes with equal
probability masses.

825
819
822
GATA2 AT2G45050
21.9
48.6
10
Bayesian networks

BNs are a framework for explaining causal
relationships consisting of a set of variables
connected by a set of directed edges
Probability calculus is used to describe the
probabilistic relationship of each variable with
its parents

The joint probability distribution over all the
variables can be written as a product of
conditional probability distributions
p(x1,xn) p(xipai)
where pai are the parents of xi

p(x1,,x7) p(x1)p(x2)p(x3)p(x4x1,x2,x3)p(x5x1,
x3)p(x6x4)p(x7x4,x5)
11
Conditional Probability Distributions
p(xipai)
Conditional probability tables for GATA4
Marginal probabilities for GATA4
12
Structure Learning

Aim is to find the model (network structure) that
has the maximum likelihood for a given set of
genes (nodes)
For a given set of genes, likelihood L
P(DS,?S) is the probability of the data D being
generated by the model

To search for a good model structure, a greedy
learning algorithm is used. From an initial
network, edges are added, reversed or deleted
until an optimum is reached.

Learned structure S arg maxS ln p(D?S,S) ½
d ln N
The BIC score has a measure of how well the model
fits the data, and a penalty term to penalise
model complexity. ?S is an estimate of the model
parameters for the structure S, d is the number
of model parameters, and N is the size of the
dataset.
13
Conditional Independence

The different structures encode the conditional
independences between the genes.
Causality the directionality of the arrows can
be determined when they lead into a v-structure
the gene at the v depends on all of its parents.
Otherwise, the direction of the causal relation
between genes cannot be discovered from data
alone. Interventions can be used.
i.e. test using mutants in the respective genes
to see which gene is mis-regulated in which
mutant. (transcript levels)

14
Method
An initial set of key genes of interest is chosen
and a network structure inferred e.g.
Circadian clock regulated
To this model a number of genes may be
added. Genes are added separately
. . .
Either all genes, or a selection
The structure learning algorithm is applied to
each set of genes, finding the GRN which is most
likely to have generated the data
. . .
The best network structure is chosen, and the
gene is added to the model
15
Results

Meaningful gene regulatory networks can be
learned from microarray data
without interventions
but using large datasets
publicly available
start to design before extra data collection

16
Predictive models
Figure 2. Given information about the state of a
genes expression level (or set of genes), the
marginal probability of any other gene (or set of
genes) being in a particular state may be
calculated. Fixing of the value of a gene (in
this case through growing a specific mutant)
allows predictions about the likely values of
other genes to be made and tested experimentally
to verify the predictive model of the GRN. This
figure shows the change in marginal likelihood of
each gene (y-axis) in Figure 1 when one other
genes value is fixed (x-axis), based on real
data, and the learned network in Figure 1. Dark
values show greatest expected change in
expression levels, whereas white values show
little observable change.
Figure 1. Bayesian network of the transcription
network for forty genes identified in light/clock
regulation of selected GATAs from the literature.
17
Future Computation

New structure learning algorithms
Strength of connections
Selecting relevant experiments
Effect of discretisation
Sensitivity to noise

18
Future Biology

We wish to learn GRNs in order to form hypotheses
about possible roles of a gene and likely
redundant genes.
Main aim is to reduce the number of related genes
to be screened for experimental verification of
findings.
Look for mis-regulation of genes predicted to be
downstream of e.g. well characterised regulators.
Make mutants of poorly characterised genes and
look for mis-regulation of gene expression or
other phenotype.
Carry these predictions from this model organism
to a crop plant, e.g. rice, where many of the
regulatory components are conserved.

19
Conclusions
20
Acknowledgments

Paul Devlin, Enrique Lopez RHUL.
NASCArrays team.
People contributing samples for array analysis at
NASC.
BBSRC, University of Leeds.

Extra slides
And
Slides pre-empting questions.

22
Benchmarks for assessment of network accuracy
23
Generating testable hypotheses

Can we generate hypotheses using gene expression
data?
Genevestigator Tool
OK for small numbers of genes
ACT Arabidopsis Co-expression Tool
co-expressed ? co-regulated
What is the regulator?

24
Arabidopsis thaliana

Many transcription factors (TFs)
what do they do?
even well characterised TFs are not fully-
characterised
Many mutants
analysis often not very successful

25
Co-expression and Co-regulation
Promoter motif over-representation
26
GATA factors and abiotic stress
27
Meristem de-etiolation arrays

Etiolated (dark-grown) seedlings.
Time course array analysis of meristem and
cotyledons following illumination.
Expression of selected GATA genes.
Enrique Lopez, Royal Holloway.

28
Co-expression scatter plots
GATA2 and GATA4 are co-expressed with phyA but
not with other phy genes. Expression of most
genes show similar correlation of expression with
GATA2 and GATA4, suggesting conservation of
expression pattern following gene duplication.
GATA9 and 12 do not show co-expression with any
of the well-characterised genes seen with GATA2
and 4. Divergence of expression following
duplication is indicated by correlation of some
genes with GATA9 but not with GATA12, perhaps
leading to sub-functionalization.
GATA21 (GNC) and 22 are co-expressed with lhy and
cca1, suggesting these GATA genes may fit within
characterised pathways. Expression of most genes
is better correlated with GATA21 than GATA22.
This may reflect a subtle divergence of
expression pattern for these duplicated genes.
29
AtGATA gene conservation
30
Expression divergence
31
Phenotypes of GATA mutants
32
Leave one out network learning
Clock genes CCA1, LHY, TOC1, GI, ELF3
ELF4. Subsidiary list CBF1, COL1, PHYA, PIF3
HY5.
33
Networks from expression correlation r-values