Title: Identifying co-regulation using Probabilistic Relational Models
1Identifying co-regulation using Probabilistic
Relational Models
- by Christoforos Anagnostopoulos
- BA Mathematics, Cambridge University
- MSc Informatics, Edinburgh University
supervised by Dirk Husmeier
2General Problematic
- Bringing together disparate data sources
Promoter sequence data ...ACGTTAAGCCAT... ...GGCAT
GAATCCC...
3General Problematic
- Bringing together disparate data sources
Promoter sequence data ...ACGTTAAGCCAT... ...GGCAT
GAATCCC...
Gene expression data gene 1 overexpressed gene
2 overexpressed ...
mRNA
4General Problematic
- Bringing together disparate data sources
Promoter sequence data ...ACGTTAAGCCAT... ...GGCAT
GAATCCC...
Gene expression data gene 1 overexpressed gene
2 overexpressed ...
mRNA
Protein interaction data protein 1 protein 2 ORF
1 ORF 2 ------------------------------------
-------------- AAC1 TIM10 YMR056C
YHR005CA AAD6 YNL201C YFL056C YNL201C
Proteins
5Our data
Promoter sequence data ...ACGTTAAGCCAT... ...GGCAT
GAATCCC...
Gene expression data gene 1 overexpressed gene
2 overexpressed ...
mRNA
6Bayesian Modelling Framework
7Bayesian Modelling Framework
Conditional Independence Assumptions
Factorisation of the Joint Probability
Distribution
UNIFIED TRAINING
8Bayesian Modelling Framework
Probabilistic Relational Models
9Aims for this presentation
- Briefly present the Segal model and the main
criticisms offered in the thesis - Briefly introduce PRMs
- Outline directions for future work
10The Segal Model
- Cluster genes into transcriptional modules...
Module 1
Module 2
?
gene
11The Segal Model
Module 1
Module 2
P(M 1)
P(M 2)
gene
12The Segal Model
Module 1
P(M 1)
gene
13The Segal Model
Motif Profile motif 3 active motif 4 very
active motif 16 very active motif 29 slightly
active
Module 1
gene
14The Segal Model
Predicted Expression Levels Array 1
overexpressed Array 2 overexpressed Array 3
underexpressed ...
Motif Profile motif 3 active motif 4 very
active motif 16 very active motif 29 slightly
active
Module 1
gene
15The Segal Model
Predicted Expression Levels Array 1
overexpressed Array 2 overexpressed Array 3
underexpressed ...
Motif Profile motif 3 active motif 4 very
active motif 16 very active motif 29 slightly
active
Module 1
P(M 1)
gene
16The Segal model
PROMOTER SEQUENCE
17The Segal model
PROMOTER SEQUENCE
MOTIF PRESENCE
18The Segal model
PROMOTER SEQUENCE
MOTIF MODEL
MOTIF PRESENCE
19The Segal model
MOTIF PRESENCE
MODULE ASSIGNMENT
20The Segal model
MOTIF PRESENCE
REGULATION MODEL
MODULE ASSIGNMENT
21The Segal model
MODULE ASSIGNMENT
EXPRESSION DATA
22The Segal model
MODULE ASSIGNMENT
EXPRESSION MODEL
EXPRESSION DATA
23Learning via hard EM
HIDDEN
24Learning via hard EM
Initialise hidden variables
25Learning via hard EM
Initialise hidden variables
Set parameters to Maximum Likelihood
26Learning via hard EM
Initialise hidden variables
Set parameters to Maximum Likelihood
Set hidden values to their most probable value
given the parameters (hard EM)
27Learning via hard EM
Initialise hidden variables
Set parameters to Maximum Likelihood
Set hidden values to their most probable value
given the parameters (hard EM)
28Motif Model
OBJECTIVE Learn motif so as to discriminate
between genes for which the
Regulation variable is on and genes
for which it is off.
r 1
r 0
29Motif Model scoring scheme
...CATTCC...
high score
low score
...TGACAA...
30Motif Model scoring scheme
...CATTCC...
high score
low score
...TGACAA...
high scoring subsequences
...AGTCCATTCCGCCTCAAG...
31Motif Model scoring scheme
...CATTCC...
high score
low score
...TGACAA...
high scoring subsequences
...AGTCCATTCCGCCTCAAG...
low scoring (background) subsequences
32Motif Model scoring scheme
...CATTCC...
high score
low score
...TGACAA...
high scoring subsequences
...AGTCCATTCCGCCTCAAG...
promoter sequence scoring
low scoring (background) subsequences
33Motif Model
SCORING SCHEME
P ( g.r true g.S, w )
parameter set
w
can be taken to represent motifs
34Motif Model
SCORING SCHEME
P ( g.r true g.S, w )
parameter set
w
can be taken to represent motifs
Maximum Likelihood setting
Most discriminatory motif
35Motif Model overfitting
TRUE PSSM
36Motif Model overfitting
typical motif ...TTT.CATTCC...
TRUE PSSM
high score
37Motif Model overfitting
typical motif ...TTT.CATTCC...
TRUE PSSM
high score
INFERRED PSSM
Can triple the score!
38Regulation Model
For each module m and each motif i, we estimate
the association umi
P ( g.M m g. R ) is proportional to
39Regulation Model Geometrical Interpretation
The (umi )i define separating hyperplanes Classi
fication criterion is the inner product Each
datapoint is given the label of the hyperplane it
is the furthest away from, on its positive side.
40Regulation Model Divergence and Overfitting
pairwise linear separability overconfident
classification
Method A dampen the parameters (eg Gaussian
prior) Method B make the dataset linearly
inseparable by augmentation
41Erroneous interpretation of the parameters
Segal et al claim that When umi 0, motif i
is inactive in module m When umi gt 0 for all
i,m, then only the presence of motifs is
significant, not their absence
42Erroneous interpretation of the parameters
Segal et al claim that When umi 0, motif i
is inactive in module m When umi gt 0 for all
i,m, then only the presence of motifs is
significant, not their absence
Contradict normalisation conditions!
43Sparsity
INFERRED PROCESS
TRUE PROCESS
44Sparsity
Reconceptualise the problem
Sparsity can be understood as pruning Pruning
can improve generalisation performance (deals
with overfitting both by damping and by
decreasing the degrees of freedom) Pruning ought
not be seen as a combinatorial problem, but can
be dealt with appropriate prior distributions
45Sparsity the Laplacian
How to prune using a prior choose a prior with
a simple discontinuity at the origin, so that
the penalty term does not vanish near the
origin every time a parameter crosses the
origin, establish whether it will escape the
origin or is trapped in Brownian motion around
it if trapped, force both its gradient and value
to 0 and freeze it Can actively look for nearby
zeros to accelerate pruning rate
46Results generalisationperformance
Synthetic Dataset with 49 motifs, 20 modules and
1800 datapoints
47Results interpretability
DEFAULT MODEL LEARNT WEIGHTS
TRUE MODULE STRUCTURE
LAPLACIAN PRIOR MODEL LEARNT WEIGHTS
48Regrets BIOLOGICAL DATA
49Aims for this presentation
- Briefly present the Segal model and the main
criticisms offered in the thesis - Briefly introduce PRMs
- Outline directions for future work
50Probabilistic Relational Models
How to model context specific regulation? Need
to cluster the experiments...
51Probabilistic Relational Models
Variable A can vary with genes but not with
experiments
52Probabilistic Relational Models
We now have variability with experiments but also
with genes!
53Probabilistic Relational Models
Variability with experiments as required but too
many dependencies
54Probabilistic Relational Models
Variability with experiments as required provided
we constrain the parameters of the probability
distributions P(EA) to be equal
55Probabilistic Relational Models
Resulting BN is essentially UNIQUE. But
derivation VAGUE, COMPLICATED, UNSYSTEMATIC
56Probabilistic Relational Models
GENES g.S1, g.S2, ... g.R1, g.R2, ... g.M g.E1,
g.E1, ...
this variable cannot be considered an attribute
of a gene, because it has attributes of its own
that are gene-independent
57Probabilistic Relational Models
GENES g.S1, g.S2, ... g.R1, g.R2, ... g.M g.E1,
g.E1, ...
58Probabilistic Relational Models
GENES g.S1, g.S2, ... g.R1, g.R2, ... g.M g.E1,
g.E1, ...
EXPERIMENTS e.Cycle_Phase e.Dye_Type
59Probabilistic Relational Models
GENES g.S1, g.S2, ... g.R1, g.R2, ... g.M g.E1,
g.E1, ...
EXPERIMENTS e.Cycle_Phase e.Dye_Type
An expression measurement is an attribute of both
a gene and an experiment.
60Probabilistic Relational Models
GENES g.S1, g.S2, ... g.R1, g.R2, ... g.M g.E1,
g.E1, ...
EXPERIMENTS e.Cycle_Phase e.Dye_Type
MEASUREMENTS m(e,g).Level
61Examples of PRMs - 1
Segal et al, From Promoter Sequence to Gene
Expression
62Examples of PRMs 1
Segal et al, From Promoter Sequence to Gene
Expression
63Examples of PRMs - 2
Segal et al, Decomposing gene expression into
cellular processes
64Examples of PRMs - 2
Segal et al, Decomposing gene expression into
cellular processes
65Probabilistic Relational Models
PRM BN1, BN2, BN3, ...
given Dataset1 PRM BN1 given Dataset2 PRM
BN2
Relational schema higher level
description of data PRM higher level
description of BNs
66Probabilistic Relational Models
- Relational vs flat data structures
- Natural generalisation knowledge carries over
- Expandability
- Richer semantics better interpretability
- No loss in coherence
- Personal opinion (not tested yet)
- Not entirely natural as a generalisation
- Some loss in interpretability
- Some loss in coherence
67Aims for this presentation
- Briefly present the Segal model and the main
criticisms offered in the thesis - Briefly introduce PRMs
- Outline directions for future work
68Future research
- Improve the learning algorithm
- soften it by exploiting sparsity
-
- systematise dynamic
- addition / deletion
69Future research
- 2. Model Selection Techniques improve
interpretability - learn the optimal number of
- modules in our model
70Future research
- 2. Model Selection Techniques improve
interpretability - learn the optimal number of
- modules in our model
- Are such methods consistent?
- Do they carry over just as well in PRMs?
71Future research
- 3. Fine tune the Laplacian regulariser to fit the
skewing of the model
72Future research
- 4. The choice of encoding the question into a
BN/PRM is only partly determined by the domain -
- Are there any general rules about how to
restrict the choice so as to promoter
interpretability?
73Future research
- 5. Explore methods to express structural,
nonquantifiable prior beliefs about the
biological domain using Bayesian tools.
74Summary
- Briefly presented the Segal model and the main
observations offered in the thesis - Briefly introduced PRMs
- Hinted towards directions for future work