Raged: robust analysis of gene expression data - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Raged: robust analysis of gene expression data

Description:

DNA microarrays are increasing the level of understanding of complex biological systems. an exponential growth in the size and ... Formalism. Bayesian network. ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 45
Provided by: paolaseb
Category:

less

Transcript and Presenter's Notes

Title: Raged: robust analysis of gene expression data


1
September 22-25, 2002
Virtual Seminars on Genomics and Bioinformatics
www.ndsu.edu/virtual-genomics
2
Microarray Gene Expression Data Analysis and
Visualization
  • Imad Rahal

3
Analysis of Microarray Gene Expression Data
  • DNA microarrays are increasing the level of
    understanding of complex biological systems
  • an exponential growth in the size and complexity
    of the data available
  • data increase can mean less information to the
    user
  • new challenges for analysis

4
Unsupervised Analysis Clustering
  • no existing knowledge about the functionality of
    genes is available prior to the analysis
  • most common form for interpreting microarray data
  • Simple
  • Graphical representations (dendograms)
  • Execution speed
  • divides the elements into distinct groups that
    are mutually exclusive and collectively
    exhaustive

5
  • elements in the same cluster exhibit similar
    biological functions
  • the choice of similarity is very critical
  • different types of clustering algorithms

6
  • Partitional clustering makes implicit assumptions
    on the form and number of clusters.
  • K-means

7
Supervised Analysis Classification
  • requires previously assembled training sets with
    each element having a label
  • classifies samples with unknown labels
    (functional category in our case)
  • can be of two forms
  • trained/eager
  • Decision tree
  • Neural nets
  • lazy/example-based
  • KNN

8
Supervised VS Unsupervised
  • Incorporates human direction (class labels)
  • Tends to be more robust specially with datasets
    having complex features (large, noisy and
    incomplete)

9
Visualization
  • achieved thru GUIs
  • provides different views of the results
  • facilitate the biological interpretation of gene
    dynamics
  • stimulates human visual pattern recognition (one
    of the best!) save processing time
  • examples

10
(No Transcript)
11
Department of Mathematics and Statistics,
University of Massachusetts, Amherst.
12
Decoding Gene Expression Control Using
Generalized Gamma Networks
Paola SebastianiDepartment of Mathematics and
Statistics, University of Massachusetts
Amherst Marco Ramoni Childrens Hospital
Informatics Program Harvard Medical School
13
Objective
  • Objective To develop models for decoding gene
    control interactions, and elucidate gene
    functions. This is one of the goals of the modern
    approach to functional genomics.
  • Modern approach
  • Genome wide Complete sequence of the genome of
    many organisms.
  • Microarray based Technology for measurement of
    thousands of gene expression level in parallel.
  • Statistically needy Incredible wealth of
    information that is not fully exploited or
    structured.

14
Central Dogma
  • A gene is expressed when it makes proteins.
  • Proteins are produced in two steps
  • Transcription The gene code is transcribed into
    mRNA dual coding
  • TranslationThe mRNA is transformed and
    transported out of the nucleus and translated
    into proteins.
  • Expression level mRNA abundance

15
From Image to Data
sample
genes
P. Sebastiani, E. Gussoni, I. S. Kohane and M.
Ramoni, (2003), Statistical Challenges in
Functional Genomics. (With discussion)
Statistical Science. To appear.
16
Common Analysis
  • Differential analysis to identify genes that
    have different level of expression in two or more
    conditions.
  • Classification rules model the dependency
    between the genes that have differential
    expression and the classes to build predictive
    models.
  • Cluster analysis to identify groups of genes
    that have same expression profiles. Intuition
    genes that belong to the same group may have
    similar functions.

17
Information and Knowledge
  • Differential analysis, naïve predictive models,
    clustering techniques increase information but
    not knowledge about the gene interaction
    mechanisms.
  • Genes do not act independently, for example

18
Example
  • Prostate cancer data set, available from the
    website of the Cancer genomics group at the
    Whitehead institute.
  • 50 samples from normal tissues 52 samples from
    cancer tissues. Expression levels of 12,625 genes
    was measured with Affymetrix U95a chip and Mas 5
    software.
  • For about 10 genes, evidence of differential
    expression was found to be very strong (posterior
    probability close to 1). Selection done with
    BADGE (www.genomethods.org/badge).

19
Models
Normal specimens
Tumor specimens
20
Formalism
  • Bayesian network.
  • A directed acyclic graph in which nodes are
    random variables describing gene expression
    levels, and arcs define directed stochastic
    dependencies from parents to children.
  • Advantage describe the dependency structure in
    probabilistic terms and simplify it by using
    Markov properties. Look at a multivariate model
    by looking at its components.

Y1
parent
Y2
child
21
Big Advantage
  • The DAG encodes local and global Markov
    properties that make learning and reasoning with
    the model very efficient.

Local Markov property Y ? ND(Y) P.
Global Markov property Y ? Y\MB(Y) MB(Y).
Y3 ?Y1 Y2 Y5 ? (Y1,Y2) Y3,Y4
Y3 ?Y1 Y2,Y5 Y4,
NDNon descendants of Y are all nodes from which
Y can be reached along a directed path.
MBMarkov blanket parentschildrenparents of
the children.
22
Factorization
Factorize the joint density as
Yi is a child variable with parents pa(Yi).
23
Advantages
  • Can exploit the factorization of the likelihood
    for
  • Quantifying the network locally
  • Learning the network structure by using a
    sequence of local searches.
  • Because the exhaustive search over all networks
    structures is unfeasible, many search algorithms
    have been developed to explore a subset of all
    possible networks.
  • Exact solutions for discrete, Gaussian and
    mixed networks if parents are discrete.

24
Distributional Assumptions
  • Microarrays produce data with skewed
    distributions.
  • Log-normal take the logarithm, data are normal.
  • Gamma they remain asymmetrical (exponential).

25
Gamma Networks
  • Gamma distributions are flexible enough to
    describe a variety of different shapes.

26
Generalized Gamma Networks
  • Encode general non linear dependencies

Yi2
Yi3
Can choose different link functions
27
Learning Dependencies
  • Given a link function, and a specification of the
    linear predictor
  • Use MCMC to estimate the parameters.
  • Compute standard MLE of the parameters qi. These
    can be computed independently of ai.
  • Given qi, compute MLE of ai (efficient procedure
    to compute this using a Taylor expansion to order
    5 about the moment estimator).

28
Likelihood Function
A product of family contributions
Multiply over k to get the full likelihood
29
Learning the Structure
  • Search the model space and assign a score to each
    network structure.
  • Select the network with optimal score.
  • The score is the posterior probability, or
    marginal likelihood when all networks are a
    priori equally likely.
  • We use BIC-MDL
  • that factorizes according to the family
    structures and allows local learning.

30
Greedy Search
Four parents
For each Y, have the set of k possible parents.
Three parents
  • In each level, choose the dependency with minimum
    MDL.
  • If smaller than the smallest in the previous
    level, accept and move up.
  • Otherwise stop and accept the last smallest in
    previous level.

Two parents
One parent
(k)
Y alone
(1)
Procedure implemented in Bayesware Discoverer
31
Models
Normal specimens
Tumor specimens
32
Validation of the Structure
  • Use blanket residuals for validation.
  • For each case in the data set
  • Compute the expected value given its Markov
    blanket
  • Repeat for all cases in the dataset

33
Structural Differences
Normal specimens
Tumor specimens
37605 collagene is independent of all other
nodes, and becomes a child of 914 oncogene with
transcription regulation functions.
34
Dependency Differences
Normal specimens
m1/(.011.8/y.40282)
Tumor specimens
m1/(0.022/y.40282)
32598 gene with putative growth and
transcription regulation functions
35
Reasoning
Networks learned from 50 normal prostatectomy
samples
25
To fully decode gene interactions, propagate
evidence in the network.
36
Evidence Propagation
  • Well known algorithms work for networks with all
    discrete, or all Gaussian variables.
  • Problems with all Gamma distributed variables
  • The joint/marginal distributions are not easy to
    compute.
  • We can use stochastic algorithms for
    probabilistic reasoning.
  • Several algorithms developed in the 90s. Focus on
    discrete networks. (Cheng and Druzdzel, AIJ
    2000).
  • Gibbs sampling appears to be the most promising.

37
Gibbs Sampling
Assumptions standard links.
38
Normal Specimens
Marginal densities when no evidence is propagated.
39
Normal Specimens
40
Tumor Specimens
Network learned from prostate cancer data set 52
tumor samples. All inverse links. Compared to the
other network, 40282_S_AT has less espression,
whereas the tumor related genes have higher
expression.
41
Normal Specimens
Mean16
a marker of tumor differentiation
Mean12
Growth and differentiation factors
Observe 40282_S_AT300 (average value in normal
specimens). Gene supposed to have a role in
immune system biology.
42
Tumor Specimens
Evidence70
Look at how changes in 40282_S_AT determine
changes of expression in tumor markers.
43
Open Issues
  • Decoding gene control How to report the results?
  • Mixing nodes with different distributions.
  • How feasible is Gibbs sampling in large networks?
  • Can we improve learning?
  • In silico biology.

44
www.genomethods.org/caged
www.genomethods.org/badge
www.genomethods.org/best
www.bayesware.com
Write a Comment
User Comments (0)
About PowerShow.com