Title: Raged: robust analysis of gene expression data
1September 22-25, 2002
Virtual Seminars on Genomics and Bioinformatics
www.ndsu.edu/virtual-genomics
2Microarray Gene Expression Data Analysis and
Visualization
3Analysis of Microarray Gene Expression Data
- DNA microarrays are increasing the level of
understanding of complex biological systems - an exponential growth in the size and complexity
of the data available - data increase can mean less information to the
user - new challenges for analysis
4Unsupervised Analysis Clustering
- no existing knowledge about the functionality of
genes is available prior to the analysis - most common form for interpreting microarray data
- Simple
- Graphical representations (dendograms)
- Execution speed
- divides the elements into distinct groups that
are mutually exclusive and collectively
exhaustive
5- elements in the same cluster exhibit similar
biological functions - the choice of similarity is very critical
- different types of clustering algorithms
6- Partitional clustering makes implicit assumptions
on the form and number of clusters. - K-means
7Supervised Analysis Classification
- requires previously assembled training sets with
each element having a label - classifies samples with unknown labels
(functional category in our case) - can be of two forms
- trained/eager
- Decision tree
- Neural nets
- lazy/example-based
- KNN
8Supervised VS Unsupervised
- Incorporates human direction (class labels)
- Tends to be more robust specially with datasets
having complex features (large, noisy and
incomplete)
9Visualization
- achieved thru GUIs
- provides different views of the results
- facilitate the biological interpretation of gene
dynamics - stimulates human visual pattern recognition (one
of the best!) save processing time - examples
10(No Transcript)
11Department of Mathematics and Statistics,
University of Massachusetts, Amherst.
12Decoding Gene Expression Control Using
Generalized Gamma Networks
Paola SebastianiDepartment of Mathematics and
Statistics, University of Massachusetts
Amherst Marco Ramoni Childrens Hospital
Informatics Program Harvard Medical School
13Objective
- Objective To develop models for decoding gene
control interactions, and elucidate gene
functions. This is one of the goals of the modern
approach to functional genomics. - Modern approach
- Genome wide Complete sequence of the genome of
many organisms. - Microarray based Technology for measurement of
thousands of gene expression level in parallel. - Statistically needy Incredible wealth of
information that is not fully exploited or
structured.
14Central Dogma
- A gene is expressed when it makes proteins.
- Proteins are produced in two steps
- Transcription The gene code is transcribed into
mRNA dual coding - TranslationThe mRNA is transformed and
transported out of the nucleus and translated
into proteins. - Expression level mRNA abundance
15From Image to Data
sample
genes
P. Sebastiani, E. Gussoni, I. S. Kohane and M.
Ramoni, (2003), Statistical Challenges in
Functional Genomics. (With discussion)
Statistical Science. To appear.
16Common Analysis
- Differential analysis to identify genes that
have different level of expression in two or more
conditions. - Classification rules model the dependency
between the genes that have differential
expression and the classes to build predictive
models. - Cluster analysis to identify groups of genes
that have same expression profiles. Intuition
genes that belong to the same group may have
similar functions.
17Information and Knowledge
- Differential analysis, naïve predictive models,
clustering techniques increase information but
not knowledge about the gene interaction
mechanisms. - Genes do not act independently, for example
18Example
- Prostate cancer data set, available from the
website of the Cancer genomics group at the
Whitehead institute. - 50 samples from normal tissues 52 samples from
cancer tissues. Expression levels of 12,625 genes
was measured with Affymetrix U95a chip and Mas 5
software. - For about 10 genes, evidence of differential
expression was found to be very strong (posterior
probability close to 1). Selection done with
BADGE (www.genomethods.org/badge).
19Models
Normal specimens
Tumor specimens
20Formalism
- Bayesian network.
- A directed acyclic graph in which nodes are
random variables describing gene expression
levels, and arcs define directed stochastic
dependencies from parents to children. - Advantage describe the dependency structure in
probabilistic terms and simplify it by using
Markov properties. Look at a multivariate model
by looking at its components.
Y1
parent
Y2
child
21Big Advantage
- The DAG encodes local and global Markov
properties that make learning and reasoning with
the model very efficient.
Local Markov property Y ? ND(Y) P.
Global Markov property Y ? Y\MB(Y) MB(Y).
Y3 ?Y1 Y2 Y5 ? (Y1,Y2) Y3,Y4
Y3 ?Y1 Y2,Y5 Y4,
NDNon descendants of Y are all nodes from which
Y can be reached along a directed path.
MBMarkov blanket parentschildrenparents of
the children.
22Factorization
Factorize the joint density as
Yi is a child variable with parents pa(Yi).
23Advantages
- Can exploit the factorization of the likelihood
for - Quantifying the network locally
- Learning the network structure by using a
sequence of local searches. - Because the exhaustive search over all networks
structures is unfeasible, many search algorithms
have been developed to explore a subset of all
possible networks. - Exact solutions for discrete, Gaussian and
mixed networks if parents are discrete.
24Distributional Assumptions
- Microarrays produce data with skewed
distributions. - Log-normal take the logarithm, data are normal.
- Gamma they remain asymmetrical (exponential).
25Gamma Networks
- Gamma distributions are flexible enough to
describe a variety of different shapes.
26Generalized Gamma Networks
- Encode general non linear dependencies
Yi2
Yi3
Can choose different link functions
27Learning Dependencies
- Given a link function, and a specification of the
linear predictor - Use MCMC to estimate the parameters.
- Compute standard MLE of the parameters qi. These
can be computed independently of ai. - Given qi, compute MLE of ai (efficient procedure
to compute this using a Taylor expansion to order
5 about the moment estimator).
28Likelihood Function
A product of family contributions
Multiply over k to get the full likelihood
29Learning the Structure
- Search the model space and assign a score to each
network structure. - Select the network with optimal score.
- The score is the posterior probability, or
marginal likelihood when all networks are a
priori equally likely. - We use BIC-MDL
- that factorizes according to the family
structures and allows local learning.
30Greedy Search
Four parents
For each Y, have the set of k possible parents.
Three parents
- In each level, choose the dependency with minimum
MDL. - If smaller than the smallest in the previous
level, accept and move up. - Otherwise stop and accept the last smallest in
previous level.
Two parents
One parent
(k)
Y alone
(1)
Procedure implemented in Bayesware Discoverer
31Models
Normal specimens
Tumor specimens
32Validation of the Structure
- Use blanket residuals for validation.
- For each case in the data set
- Compute the expected value given its Markov
blanket - Repeat for all cases in the dataset
33Structural Differences
Normal specimens
Tumor specimens
37605 collagene is independent of all other
nodes, and becomes a child of 914 oncogene with
transcription regulation functions.
34Dependency Differences
Normal specimens
m1/(.011.8/y.40282)
Tumor specimens
m1/(0.022/y.40282)
32598 gene with putative growth and
transcription regulation functions
35Reasoning
Networks learned from 50 normal prostatectomy
samples
25
To fully decode gene interactions, propagate
evidence in the network.
36Evidence Propagation
- Well known algorithms work for networks with all
discrete, or all Gaussian variables. - Problems with all Gamma distributed variables
- The joint/marginal distributions are not easy to
compute. - We can use stochastic algorithms for
probabilistic reasoning. - Several algorithms developed in the 90s. Focus on
discrete networks. (Cheng and Druzdzel, AIJ
2000). - Gibbs sampling appears to be the most promising.
37Gibbs Sampling
Assumptions standard links.
38Normal Specimens
Marginal densities when no evidence is propagated.
39Normal Specimens
40Tumor Specimens
Network learned from prostate cancer data set 52
tumor samples. All inverse links. Compared to the
other network, 40282_S_AT has less espression,
whereas the tumor related genes have higher
expression.
41Normal Specimens
Mean16
a marker of tumor differentiation
Mean12
Growth and differentiation factors
Observe 40282_S_AT300 (average value in normal
specimens). Gene supposed to have a role in
immune system biology.
42Tumor Specimens
Evidence70
Look at how changes in 40282_S_AT determine
changes of expression in tumor markers.
43Open Issues
- Decoding gene control How to report the results?
- Mixing nodes with different distributions.
- How feasible is Gibbs sampling in large networks?
- Can we improve learning?
- In silico biology.
44www.genomethods.org/caged
www.genomethods.org/badge
www.genomethods.org/best
www.bayesware.com