Title: Microarray data analysis
1Bioinformatics I the path or array data analysis
- Microarray data analysis
- Signal computation
- Data normalization
- Gene filtration
- Expression profile clustering
- Knowledge mining
2Bioinformatics I the path or array data analysis
- Microarray data analysis
- Signal computation
- Data normalization
- Gene filtration
- Expression profile clustering
- Knowledge mining
3Bioinformatics I array data normalization
Analysis Software Microarray Analysis Suite 5.0
(MAS 5.0) Affymetrix dChip Li and
Wong Bioconductor -- Open Source RMA package
from bioconductor.org
4Bioinformatics I -- Normalization
Analysis Software Gene filtration Step 1
Normalization of data from GeneChips Normalizatio
n and scaling even out global signal differences
between arrays that may be due to biological
(background, sample treatment), biochemical
molecular biological (buffers, enzymes, reaction
conditions) or technological reasons (GeneChip
quality, hybridization conditions). This
procedure enables you to compare an experiment
array to a control (baseline) array.
5Bioinformatics I -- Normalization
Analysis Software Gene filtration in Microarray
Analysis Suite 5.0 (MAS 5.0) Step 1
Normalization of data from GeneChips Scaling
adjusts both experimental and control array to a
user-define value. Normalization adjusts the
average signal of the experiment to the average
signal of the control (baseline) data set.
6Bioinformatics I -- Normalization
Analysis Software Gene filtration Step 1
Normalization of data from GeneChips All probe
sets Selected Probe sets Scaling adjusts the
average signal of the experiment and the average
signal of the control (baseline) data set to a
user-specified value. A scaling factor sf is
calculated Sc target signal (default 500)
SignalLogValues probesets indicated in the
settings TrimMean average values minus top and
bottom 2 User defined If the settings indicate
user-specified normalization then sf user value
Sc
sf
TrimMean (2SignalLogValuei , 0.02, 0.98)
7Bioinformatics I -- Normalization
Analysis Software Gene filtration Step 1
Normalization of data from GeneChips All probe
sets Selected Probe sets Normalization adjusts
the average signal of the experiment to the
average signal of the control (baseline) data
set. A normalization factor nf is calculated
SPVbi Scaled Probe Value is the baseline
signal SPVei experiment signal i
probeset User defined If the settings indicate
user-specified normalization then nf user value
8Bioinformatics I -- Filtration
Analysis Software Gene filtration Step 2
remove genes that display background
signals Usually you filter out genes for which
you have Absent calls because the system
indicates a problem. Either the genes are
expressed at too low a level or the MM signal is
higher than the PM signal.
9Bioinformatics I -- Filtration
Analysis Software Gene filtration Step 3
filter genes that display a signal change during
Comparison Analysis MAS 5.0 uses the Change
Algorithm to calculate a Change p-value and an
associated Change Each probeset on the
experiment array is compared to its counterpart
on the baseline (control) array and a Change
p-value is calculated that indicates an Increase
(I) Marginal Increase (MI) Decrease (D)
Marginal Decrease (MD) or No Change (NC) in gene
expression
10Bioinformatics I -- Filtration
Analysis Software Gene filtration Step 3
filter genes that display a signal change during
Comparison Analysis MAS 5.0 employs the Change
Algorithm to calculate a Change p-value and an
associated Change Wilcoxons signed rank test
uses differences between PM and MM as well as PM
and background signals to compute one final
p-value that ranges from 0.0 to 1.0. This values
tells you how likely it is that signals change
and in which direction (increase, decrease)
Values 0.0 Increase Values 0.5 No
Change Values 1.0 Decrease
11Bioinformatics I -- Filtration
Analysis Software Gene filtration Change
p-value is defined by cutoff values g1 and g2
that provide boundaries for Change Calls
12Bioinformatics I -- Filtration
Analysis Software Gene filtration Major
disadvantage of MAS 5.0 you can only compare two
samples at the time! This gets out of hand
quickly if you perform complex experiments. The
analysis is based upon PM-MM values!
13Bioinformatics I -- Filtration
The model behind MAS 5.0 log (PMij CTij)
log (qi) eij, j 1, , J CT replaces MM
when MMgtPM (avoid log of neg. numb.) qi
expression quantity on array eij error i
arrays j probe pairs Problem error does not
have equal variance for j 1, , J gtgt larger
mean intensities have larger variances!
14Bioinformatics I -- Filtration
Observations Probe intensities can vary more
within a probeset than within identical probes on
two GeneChips gtgtthere is a strong probe
affinity effect Li and Wong propose PMij -
MMij qifj eij, i 1, , I, j 1,J fj probe
affinity effect (estimate based on empirical
analysis of a sufficient number of arrays),
outlier probes are removed. dChip implements
this model and yields dChip expression measure
15Bioinformatics I raw data computation and
normalization
16Bioinformatics I -- Filtration
The Robust Multi-array Average is a summary
measure of background-adjusted, normalized and
log-transformed PM values. The normalization
procedure is based upon feature/probe level
intensities
17Bioinformatics I -- Filtration
RMA is based on the model T(PMij) ei ai
eij, i 1, , I, j 1,J T transformation that
corrects for background, normalizes and logs PM
intensities ei log2 expression value found on
arrays i 1, , I ai log scale affinity effects
on arrays j 1, , J eij error MM does contain
some true signal but mathematical subtraction
does not reflect biological substraction!
18Bioinformatics I -- Filtration
Log2 fold change estimates of differential
expression between liver and CNS samples were
computes for arrays bybridized to 1.25 mg of cRNA
and plotted against the same estimates from
arrays bybridized to 20 mg of cRNA gtgtgtRMA
produces only very few discrepancies as compared
to MAS and dChip
19Bioinformatics I -- Clustering
Analysis Software The purpose of clustering is
to group genes that display similar expression
patterns together. Most algorithms are
hierarchical and unsupervised Supervised
clustering uses biological information to bias
the clustering algorithm
20Bioinformatics I -- Clustering
- Analysis Software
- Hierarchical clustering uses an agglomerative
approach to form a single hierarchical tree. It
follows a five step procedure - The pair-wise distance matrix is calculated for
all patterns/genes - The matrix is searched for the two most similar
patterns/genes - Two selected clusters that contain one
pattern/gene are merged - The distances are calculated between this new
cluster (that now contains two patterns/genes)
and all other clusters - Steps 2. 4. are iterated until all
patterns/genes are in one cluster
21Bioinformatics I -- Clustering
- Analysis Software
- K-means clustering bins patterns/genes in a
user-sepcified number of clusters. It follows a
five step procedure - All patterns/genes are assigned at random to a
number k of clusters - An average expression vector is computed and used
to calculate the average distance between the
clusters - Patterns/genes are iteratively moved between
clusters and the intra- and inter-cluster
distances are computed after each round of
shuffling. Patterns stay in a cluster only when
they are closer to it than to the previous one. - After each round the expression vectors for each
cluster are computed - The process stops when moving patterns/genes
would decrease cluster similarity
22Bioinformatics I -- Clustering
23Bioinformatics I -- Interpretation
Knowledge extraction/interpretation This is the
process of extracting biological information from
an expression profiling experiment (how many and
which genes respond to a stimulus/cell cycle
progression/developmental processes). Many
analysis programs now have integrated features
that facilitate the identification of pathways in
a genomics dataset Bayesian Networks (Friedman
et al.) This approach attempts to estimate
correlations and interdependencies between genes
based upon yeast expression data http//www.cs.huj
i.ac.il/labs/compbio/expression/tour Modelling
the yeast cell cycle (John Tyson et al.) Employs
differential equations to simulate the cell cycle
and to predict the effect of mutations
http//mpf.biol.vt.edu/
24Bioinformatics I -- Software
Publication Data repository You now must upload
your raw data to ArrayExpression or GeneOmnibus
and provide an accession number upon submission
of your manuscript Database You can (but you do
not have to) provide your data in a web
accessible database that may include graphical
display of the data Web portal Genomics
experiments usually require a web portal that
provides background information, images,
downloadable files and web links
25Bioinformatics I -- Summary
Summary Hypothesis Design your experiment such
that you test a hypothesis and try to create
conditions where you can correlate expression
patterns to a developmental stage, pathology or a
specific experimental condition (treated versus
untreated, heat shock, wild type versus mutant)
Data analysis I raw data computation Bear in
mind the pros and cons of a given algorithm that
produces your signals (MAS 5.0, dCHIP, RMA) and
remember that you can twist and tweak the
system Data analysis II gene filtration Use
appropriate normalization or scaling procedures
and try several different algorithms that yield
differentially expressed genes Data analysis
III clustering There is no perfect cluster
algorithm that works for every experiment. Try a
combination of hierarchical and k-means
algorithms (using different correlation or
similaroty measures for each of them) or Self
Orgamising Maps (SOMs) to get groups of genes
that display similar expression patterns
26Bioinformatics I -- eXam
February 10, 2004 at Hoersaal II between 900 am
and 1000 am Regular exam...