Title: Using Bayesian Networks to Analyze Expression Data
1Using Bayesian Networks to Analyze Expression Data
- Nir Friedman, Michal Linial, Iftach Nachman, and
Dana Peer - Announcer Kyu-Baek Hwang
2Abstract
3Introduction
- A central goal of molecular biology is to
understand the regulation of protein synthesis. - DNA microarray experiments can measure thousands
of gene expression levels simultaneously. - An important challenge is to develop
methodologies that are both statistically sound
and computationally tractable. - Bayesian network learning.
4An Example of a Simple Bayesian Network Structure
- - Gene B and Gene D are independent given Gene A.
- Gene B asserts dependency between Gene A and
Gene E. - Gene A and Gene C are independent given Gene B.
5Representing Distributions with Bayesian Networks
- A Bayesian networks is a representation of a
joint probability distribution. - A Bayesian network has two components.
- G a directed-acyclic graph structure
- ? a set of parameters for conditional
distribution of each variable - The joint probability distribution of X1, , Xn
is represented by Bayesian network as follows - where PaG(Xi) is the set of parents of Xi.
6Equivalence Classes of Bayesian Networks
- More than one graph can imply exactly the same
set of independencies. - Theorem 2.1 Two DAGs are equivalent if and only
if they have the same underlying undirected graph
and the same v-structures. - The PDAG (Partially DAG) structure uniquely
represents an equivalence class of network
structures.
7Learning Bayesian Networks
- Given a training set D x1, , xN of
independent instances of X, find a network B
ltG, ?gt that best matches D. - The score function for a network is defined as,
- where C is a constant independent of G and
- is the marginal likelihood which averages the
probability of the data over all possible
parameter assignments to G.
8Learning Causal Patterns
- Causal networks have a stricter interpretation of
the meaning of edges the parents of a variable
are its immediate causes.
9Analyzing Expression Data
- Consider probability distributions over all
possible states of the system in question. - Describe the state of the system using random
variables. - These random variables include
- Expression levels of individual genes,
- Experimental conditions,
- Temporal indicators, and
- Background variables.
10Representing Partial Models
- Analyze the set of plausible networks and attempt
to characterize features that are common to most
of these networks. - Features
- Markov relations Is Y in the Markov blanket of
X? - Order relations Is X an ancestor of Y in all the
networks of a given equivalence class?
11Estimating Statistical Confidence in Features
- To what extent does the data support a given
feature? - An effective and relatively simple approach for
estimating confidence is the bootstrap method. - For i 1, , m
- Re-sample with replacement N instances from D.
Denote by Di the resulting dataset. - Apply the learning procedure on Di to induce a
network structure G. - For each feature f of interest calculate
- where f(G) is 1 if f is a feature in G, and 0
otherwise.
12Efficient Learning Algorithms
- Sparse Candidate algorithm
- Identify a relatively small number of candidate
parents for each variable based on simple local
statistics (such as correlation). - Restrict search space to candidate parents.
- Greedy search with restriction on the search
space. - Score for the set of candidate parents
13Local Probability Models
- Multinomial model
- Discretization of the expression levels
- Linear Gaussian model
- P(Xu1, u2, , uk) N(a0 ?iaiui, ?2)
14Application to Cell Cycle Expression Patterns
- The data of Spellman
- 76 gene expression measurements of the mRNA
levels of 6177 S. cerevisiae ORFs. - Six time series under different cell cycle
synchronization methods. - Each measurement was treated as an independent
sample from a distribution. - An additional root variable denoting the cell
cycle phase.
15The Learned Bayesian Network Structure
16Robustness Analysis
- Create a random data set by randomly permuting
the order of the experiments independently for
each gene.
17Comparison of Multinomial Distribution and Linear
Gaussian Distribution
18Biological Analysis of Order Relations
- Dominant score of X is defined as
- ?Y,Co(X,Y)gttCo(X,Y)k
19Biological Analysis of Markov Relations
20Conditional Independence in the Network
21Discussion and Future Work
- A novel search algorithm
- An approach for estimating statistical confidence
- Discover causal relationships and interactions
between genes - Probabilistic semantics
- Future extensions
- Local probability models
- Estimating confidence levels
- Biological knowledge as prior
- Search heuristics
- Temporal models
- Discover hidden variables (e.g. protein
activation)