Title: Rich Probabilistic Methods for Gene Expression
1Rich Probabilistic Methodsfor Gene Expression
Eran Segal Ben Taskar Audrey Gasch Nir Friedman
Daphne Koller
2Outline
- Motivation for richer models
- PRMs for gene expression Modeling Learning Inf
erence - Results Synthetic Stress Compendium
3One Sided Clustering
- Non-Parametric Clustering
- Hierarchical Agglomerative
- SVD
- K-means
- Parametric Clustering
- Probabilistic Clustering
Autoclass using expression levels
Gene-cluster
Level-1
Level-n
Level-2
experiments
4One Sided Clustering
Experiments
Undetected Separability
Cluster 1
Cluster 2
Undetected Similarity
Cluster 3
Genes
Cluster 4
Cluster 5
Cluster 6
5Basic Bi-Clustering
Experiments
Detected Separability
C1
C2
C3
C4
C5
C6
Undetected Similarity
C7
C8
C9
Genes
C11
C10
C12
C14
C13
C15
C17
C16
C18
6Desired Clustering
- Allow for non-grid clusters
- Rows no longer correspond to genes (similarly
for columns)
Experiments
Detected Separability
C1
C2
C3
C4
C5
Detected Similarity
C6
C7
Genes
C8
C9
C11
C10
C12
C14
C13
C15
7Basic Bi-Clustering
Clust(gene2)
Clust(exp2)
Clust(gene1)
Clust(exp1)
G1-E2
G1-E1
G2-E2
G2-E1
Two-sided clustering (PLSA, Hoffman)
8Outline
- Motivation for richer models
- PRMs for gene expression Modeling Learning Inf
erence - Results Synthetic Stress Compendium
9PRMs Basic Bi-Clustering
Classes of objects
Gene
Experiment
Gene-cluster
Exp. cluster
Expression
Level
Compact representation of two-sided clustering
10PRMs Relational Schema
- Describes the types of objects and relations in
the database
Gene
Experiment
Mutation
Cluster
Cluster
Binding Sites
Exp. Attributes
Functional Classes
Expression
Exp. Level
11PRM for Compendium Data
- Parameters for nodes
- Structure over gene features
Gene
Array/Mutated Gene
GCluster
GCluster (of mutated gene)
GCN4
HSF
Lipid (of mutated gene)
ACluster
Lipid
Endoplasmatic
Expression
Level
12Resulting Bayesian Network
- 3 Genes, 2 Mutation Experiment
Lipid
Lipid
ACluster1
ACluster1
GCluster1
Endoplasmatic
E1,2
E1,1
GCluster2
Endoplasmatic
E2,1
E2,2
GCluster2
Endoplasmatic
E3,1
E3,2
13PRM Learning
Data
Gene
Experiment
Gene-cluster
Exp. cluster
Learner
Expression
Level
Expert knowledge
- PRM models can be learned from empirical data
- parameter estimation
- structure learning learning the dependency
structure - Can learn with missing data hidden variables
14PRM Learning
- Goal Find PRM structure that explains the data
well - Define scoring function to evaluate models
- Bayesian Score works bestScore (SD) log
P(D S) P(S) - Automatically trades off fit to data (likelihood
of data) with model complexity - Do heuristic search to find high-scoring
structure - Structure found is not necessarily best one
Marginal likelihood
Prior
15Learning PRMs
- Parameter Estimation EM Approximate Inference
for E-Step - Structure Learning Complete Data Learning Tree
splits Avoiding Local Maxima - Structure Learning Incomplete Data Iterate
until convergence (Hard SEM) EM Hard
assignment Structure Learning
16Context Specific Dependencies
GCluster 0 (of gene)
true
false
GCluster 3(of mutant)
. . .
false
true
HSF gt 2
ACluster 4
false
true
false
true
Endoplasmatic
Level
Level
. . .
true
false
Level
Level
17Learning Process
Gene
Array/Mutated Gene
GCluster
HSF
GCN4
Lipid (of mutated gene)
GCluster (of mutated gene)
Lipid
Endoplasmatic
ACluster
Expression
Level
Experiments
Genes
18Learning Process
Gene
Array/Mutated Gene
GCluster
HSF
GCN4
Lipid (of mutated gene)
GCluster (of mutated gene)
Lipid
Endoplasmatic
ACluster
Expression
Level
Experiments
Genes
19Learning Process
Gene
Array/Mutated Gene
GCluster
HSF
GCN4
Lipid (of mutated gene)
GCluster (of mutated gene)
Lipid
Endoplasmatic
ACluster
Expression
Level
Gene Similarity
Experiments
Genes
20Learning Process
Gene
Array/Mutated Gene
GCluster
HSF
GCN4
Lipid (of mutated gene)
GCluster (of mutated gene)
Lipid
Endoplasmatic
ACluster
Expression
Level
Experiment Similarity
Experiments
Genes
21Learning Process
Gene
Array/Mutated Gene
GCluster
HSF
GCN4
Lipid (of mutated gene)
GCluster (of mutated gene)
Lipid
Endoplasmatic
ACluster
Expression
Level
Separability by TF
Experiments
Genes
22Learning Process
Gene
Array/Mutated Gene
GCluster
HSF
GCN4
Lipid (of mutated gene)
GCluster (of mutated gene)
Lipid
Endoplasmatic
ACluster
Expression
Level
Attribute Dependencies
Experiments
InduceCluster Change
Genes
23Learning Process
Gene
Array/Mutated Gene
GCluster
HSF
GCN4
Lipid (of mutated gene)
GCluster (of mutated gene)
Lipid
Endoplasmatic
ACluster
Expression
Level
Achieved Desired Clustering
Experiments
Genes
24Outline
- Motivation for richer models
- PRMs for gene expression Modeling Learning Inf
erence - Results Synthetic Stress Compendium
25Synthetic Data Recovering Structure
- Synthetic data 1000 genes, 90 arrays (12 types)
- Parents recovered Simulated data 84.5 /-
2.5 Permuted data 56 /- 2.5 - Cluster recovery Simulated data PRMs
98.4 /- 1.07 Naïve Bayes 90.8
/- 0.42 Permuted data PRMs 88.1
/- 1.52 Naïve Bayes 76.7 /-
1.42
26Stress Data
- 954 genes, 88 arrays (12 types)
- Structure learning 15 significant TFs 7
significant function categories - Cluster coherence Average variance reduction
0.69 -gt 0.61 in 3 iterations - Allowing annotation changes Average variance
reduction 0.69 -gt 0.56 in 3 iterations
27Fragment of PRM for Yeast Stress Data (Gasch al)
Gene
GCluster
Array
Carbon
AAM
Condition
Mig1
Expression
Level
28Result Context-Specific Groupings
- A grouping is a set of genes that behave the same
within a certain context a condition or a set
of conditions - Breakdown of genes into clusters is different in
different contexts
Yeast Stress Data (Gasch al)
29Example Biological Result
- Discovered grouping of 17 genes
- all induced in diauxic shift
- all have ? 2 binding sites for Mig1 transcription
factor - many not known to have been regulated by Mig1
- Context-sensitive groupings were key to
identifying cluster
30Compendium Data Results
- Figure out array cluster of particular gene
mutation before performing the experiment - Can hope to do this because
- array cluster depends on gene cluster
- gene cluster predicted based on behavior in other
arrays
1
44 arrays predictedat 95 accuracy
0.8
0.6
Correct predictions
Accuracy / Predicted
Total predicted
0.4
0.2
0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Prediction confidence
31Future Directions
- Handling time
- Handling sequence data (TFs)
- Incorporate structure information
- Discovering pathways