Title: Analysis of Microarray Data Using EXPANDER and SHARP
1Analysis of Microarray Data Using EXPANDER and
SHARP
2Input data
Normalization/ Filtering
Links to public annotation DBs (Hs, Mm, Rn, Dm,
S.cer)
Visualization utilities
Clustering (CLICK, SOM, K-means, Hierarchical)
Biclustering (SAMBA)
Functional enrichment (TANGO)
Promoter signals (PRIMA)
EXPANDER work flow
3Input data
Normalization/ Filtering
Links to public annotation DBs (Hs, Mm, Rn, Dm,
S.cer)
Visualization utilities
Clustering (CLICK, SOM, K-means, Hierarchical)
Biclustering (SAMBA)
Functional enrichment (TANGO)
Promoter signals (PRIMA)
4EXPANDER Input Data
- Input data
- Expression matrix (probes-rows
conditions-columns) - One-channel data (e.g., Affymetrix)
- Dual-channel data (cDNA microarrays, data are
(log) ratios between the Red and Green channels) - ID conversion file map probe to gene ids
51. Normalization
6Outline
- What is normalization
- Why is normalization needed
- Three quantitative methods for normalization
- Software tools
7Hybridization of the same sample to 2
chips/channels
- Ideally scatter plot coincides with the xy
diagonal - Due to Random errors we expect to see a cloud
around the xy diagonal.
Probe intensity - 2
Probe intensity - 1
8Hybridization of the same sample to 2
chips/channels
- In practice Both Random and Systematic
measurement errors (Bias) - Due to Biases scatter plots are not centered
around the x-y diagonal
9Hybridization of the same sample to 2
chips/channels
10Normalization the process of removing
systematic errors (biases) from the data
11Sources of Systematic Errors
- Different incorporation efficiency of dyes
- Different amounts of mRNA
- Experimenter/protocol issues (comparing chips
processed by different labs) - Different scanning parameters
- Batch bias
12Normalization - two problems
- How to detect biases? Which genes to use for
estimating biases among chips/channels? - How to remove the biases?
13Which Genes to use for bias detection?
- All genes on the chip
- Assumption Most of the genes are equally
expressed in the compared samples, the proportion
of the differential genes is low (lt20). - Limits
- Not appropriate when comparing highly
heterogeneous samples (different tissues) - Not appropriate for analysis of dedicated chips
(apoptosis chips, inflammation chips etc)
14Which Genes to use for bias detection?
- Housekeeping genes
- Assumption based on prior knowledge a set of
genes can be regarded as equally expressed in the
compared samples - Affy novel chips normalization set of 100
genes - NHGRIs cDNA microarrays 70 "house-keeping"
genes set - Limits
- The validity of the assumption is questionable
- Housekeeping genes are usually expressed at high
levels, not informative for the low intensities
range
15Which Genes to use for bias detection?
- Spiked-in controls from other organism, over a
range of concentrations - Limits
- low number of controls- less robust
- Cant detect biases due to differences in RNA
extraction protocols - Invariant set
- Trying to identify genes that are expressed at
similar levels in the compared samples without
relying on any prior knowledge - Rank the genes in each chip according to their
expression level - Find genes with small change in ranks
16Normalization Methods
171. Global normalization (Scaling)
- A single normalization factor (k) is computed for
balancing chips\channels - Xinorm kXi
- Multiplying intensities by this factor equalizes
the mean (median) intensity among compared chips
18Global Normalization
Before
After
19Boxplots
Log (Intensity)
Upper quartile
Median intensity
Lower quartile
20Before Normalization
After Scaling
212. Intensity-dependent normalization (Yang, Speed)
- (Lowess local linear fit)
- Compensate for intensity-dependent biases
22Detect Intensity-dependent Biases M vs A plots
- X axis A average intensity
- A 0.5log(Cy3Cy5)
- Y axis M log ratio
- M log(Cy3/Cy5)
23We expect the M vs A plot to look like
M log(Cy3/Cy5)
A
24Intensity-dependent bias
M log(Cy3/Cy5)
Global normalization cannot remove
intensity-dependent biases
A
25(No Transcript)
263. Quantile Normalization
Before Normalization
After Scaling
27quantile normalization equalizing the entire
distribution
28Quantile Normalization
- Sort intensities in each chip
- Compute mean intensity in each rank across the
chips - Replace each intensity by the mean intensity at
its rank
Average chip
Chip 1
Chip 2
Chip 3
29Normalization - tools
- Bioconductor (both AFFY and cDNA)
- Packages in R language
- dChip (Affymetrix)
- Quantile, Invariant set
- Expander (Affy)
- Lowess
- Quantile
30Acknowledgements
- Figures in this presentations were taken in part
from presentations of - Henrik Bengtsson, Terry Speed
- Yee Yang, Terry Speed
- Guilherme J. M. Rosa
- Laurent Gautier, Rafael Irizarry, Leslie Cope,
and Ben Bolstad
312. Identification of Differential Genes
32Identification of differential genes
- The most basic experimental design comparison
between 2 conditions treatment vs control - The goal to identify genes that are
differentially expressed in the examined
conditions - Number of replicates is usually low (n2-4)
331. Fold Change
- Consider genes whose mean expression level was
change by at least 1.75-2 fold as differential
genes - Limits
- Usually no estimation of false positive rate is
provided - Biased to genes with low expression level
- Ignores the variability of gene levels over
replicates.
34Fold Change limit ignores variability over
replicates
- Seek for score that punishes genes with high
variability over replicates
352. T-test
- Compute a t-score for each gene
mc, mt mean levels in Control and
Treatment Sc2, St2 variance estimates in
Control and Treatment nc, nt number of
replicates in in Control and Treatment
36T - test
- t-scores can be associated with p-value (under
the assumption that expression levels follow
normal distribution) - Log-transformation
- Set cut-off for p-value (a0.01)
- Consider all genes with p-value lt a as
differential genes
37Multiple Testing
- P-valg associated with the t-score Tg is the
probability for obtaining by random a t-score
that is at least as extreme as Tg. - Multiplicity problem thousands of genes are
tested simultaneously. - e.g. suppose
- 10,000 genes on a chip
- not a single one is differentially expressed.
- a0.01
- 10000x0.01 100 genes are expected to have a
p-value lt 0.01 just by chance.
38Multiple testing
- Need to adjust for multiple testing when
assessing the statistical significance of
findings - Corrections
- Bonferroni (e.g., a0.01, N10,000
cut-off0.000001) - False Discovery Rate (FDR)
- In high-throughput studies certain proportion of
false positives is tolerable - Control the expected proportion of false
positives among the genes identified as
differential (q10).
39Differential Genes - Tools
- Cyber-T
- SAM (Significance Analysis of Microarray)