The second-simplest cDNA microarray data analysis problem - PowerPoint PPT Presentation

1 / 62

About This Presentation

Title:

The second-simplest cDNA microarray data analysis problem

Description:

From Peter McCallum Cancer Research Institute, Australia. Normalisation - print tip ... David Freedman. CSIRO Image Analysis Group. Michael Buckley. Ryan ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 63

Provided by: cen7154

Category:

more less

Transcript and Presenter's Notes

Title: The second-simplest cDNA microarray data analysis problem

1
The second-simplest cDNA microarray data
analysis problem

Terry Speed, UC Berkeley
Bioinformatic Strategies For Application of
Genomic Tools to Environmental Health Research,
March 5, 2001
NIEHS National Center for Toxicogenomics NCSU
Bioinformatics Research Center

2
Biological question Differentially expressed
genes Sample class prediction etc.
Experimental design
Microarray experiment
16-bit TIFF files
Image analysis
(Rfg, Rbg), (Gfg, Gbg)
Normalization
R, G
Estimation
Testing
Clustering
Discrimination
Biological verification and interpretation
3
Some motherhood statements

Important aspects of a statistical analysis
include
Tentatively separating systematic from random
sources of variation
Removing the former and quantifying the latter,
when the system is in control
Identifying and dealing with the most relevant
source of variation in subsequent analyses
Only if this is done can we hope to make more or
less valid probability statements

4
The simplest cDNA microarray data analysis
problem is identifying differentially expressed
genes using one slide

This is a common enough hope
Efforts are frequently successful
It is not hard to do by eye
The problem is probably beyond formal statistical
inference (valid p-values, etc)
for the foreseeable future, and heres why

5
An M vs. A plot
M log2(R / G) A log2(RG) / 2
6
Background matters
From Spot
From GenePix
7
No background correction
With background correction
From the NCI60 data set (Stanford web site)
8
An experiment having within-slide replicates
9
Background makes a difference
Background method Segmentation method Exp1
Exp2 S.nbg 6 6 Gp.nbg 7 6 SA.nbg 6
6 No background QA.fix.nbg 7 6 QA.hist.nbg
7 6 QA.adp.nbg 14 14 S.valley 17 21 GP
11 11 Local surrounding SA 12 14 QA.fix
18 23 QA.hist 9 8 QA.adp 27 26 Others
S.morph 9 9 S.const 14 14
Medians of the SD of log2(R/G) for 8 replicated
spots multiplied by 100 and rounded to the
nearest integer.
10
Normalisation - lowess

Global lowess (Matt Callows data, LNBL)
Assumption changes roughly symmetric at all
intensities.

11
From the NCI60 data set (Stanford web site)
12
Ngai lab, UCB
13
Tiagos data from the Goodman lab, UCB
14
From the Ernest Gallo Clinic Research Center
15
From Peter McCallum Cancer Research Institute,
Australia
16
Normalisation - print tip
Assumption For every print group, changes
roughly symmetric at all intensities.
17
M vs A after print-tip normalisation
18
Normalization (ctd) Another data set
Log-ratios
Print-tip groups

After within slide global lowess normalization.
Likely to be a spatial effect.

19
Taking scale into account

Assumption
All print-tip-groups have the same spread
in M
True log ratio is mij where i represents
different print-tip-groups and j
represents different spots.
Observed is Mij, where
Mij ai mij
Robust estimate of ai is
MADi medianj yij - median(yij)

20
Normalization (ctd) That same data set
Log-ratios
Print-tip groups

After print-tip location and scale normalization.
Incorporate quality measures.

21
Matt Callows Srb1 dataset (5). Newtons and
Chens single slide method
22
Matt Callows Srb1
dataset (8). Newtons, Sapir Churchills and
Chens single slide method
23
The approach of Roberts et al (Rosetta)
Genomic DNA vs. Genomic DNA
Data from Bing Ren
24
The second simplest cDNA microarray data analysis
problem is identifying differentially expressed
genes using replicated slides

There are a number of different aspects
First, between-slide normalization then
What should we look at averages, SDs
t-statistics, other summaries?
How should we look at them?
Can we make valid probability statements?
A report on work in progress

25
Normalization (ctd) Yet another data set

Between slides this time (10 here)
Only small differences in spread apparent
We often see much greater differences

Log-ratios
Slides
26
The NCI 60 experiments (no bg)
27
Taking scale into account

Assumption All slides have the same spread
in M
True log ratio is mij where i represents
different slides and j represents different
spots.
Observed is Mij, where
Mij ai mij
Robust estimate of ai is
MADi medianj yij - median(yij)

28
Which genes are (relatively) up/down regulated?

Two samples.
e.g. KO vs. WT or mutant vs. WT

? n
T
C
? n
For each gene form the t statistic
average of n trt Ms sqrt(1/n (SD of n trt
Ms)2)
29
Which genes are (relatively) up/down regulated?

Two samples with a reference (e.g. pooled control)

? n
T
C
? n
C
C

For each gene form the t statistic
average of n trt Ms - average of n ctl Ms
sqrt(1/n (SD of n trt Ms)2 (SD of n ctl Ms)2)

30
One factor more than 2 samples
T2
T3
T4
T1
x 2
x 2
x 2
x 2
C

Samples Liver tissue from mice treated by
cholesterol modifying drugs.
Question 1 Find genes that respond differently
between the treatment and the control.
Question 2 Find genes that respond similarly
across two or more treatments relative to control.

31
One factor more than 2 samples
T6
T1
T5
T2
T4
T3

Samples tissues from different regions of the
mouse olfactory bulb.
Question 1 differences between different
regions.
Question 2 identify genes with a pre-specified
patterns across regions.

32
Two or more factors

6 different experiments at each time point.
Dyeswaps.
4 time points (30 minutes, 1 hour, 4 hours, 24
hours)
2 x 2 x 4 factorial experiment.

ctl
OSM
? 4 times
OSM EGF
EGF
33
Which genes have changed?When permutation
testing possible

1. For each gene and each hybridisation (8 ko 8
ctl), use Mlog2(R/G).
2. For each gene form the t statistic
average of 8 ko Ms - average of 8 ctl Ms
sqrt(1/8 (SD of 8 ko Ms)2 (SD of 8 ctl Ms)2)
3. Form a histogram of 6,000 t values.
4. Do a normal Q-Q plot look for values off the
line.
5. Permutation testing.
6. Adjust for multiple testing.

34
Histogram qq plot
ApoA1
35
Apo A1 Adjusted and Unadjusted p-values for the
50 genes with the largest absolute t-statistics.
36
Which genes have changed?Permutation testing not
possible

Our current approach is to use averages, SDs,
t-statistics and a new statistic we call B,
inspired by empirical Bayes.
We hope in due course to calibrate B and use that
as our main tool.
We begin with the motivation, using data from a
study in which each slide was replicated four
times.

37
Results from 4 replicates
38
BLOR compared
39

M
t
t ?M

Results from the Apo AI ko experiment
40

M
t
t ?M

Results from the Apo AI ko experiment
41
Empirical Bayes log posterior odds ratio
42

M
B
t
M ? B
t ?B
t ? M ?B

Results from SR-BI transgenic experiment
43

M
B
t
M ? B
t ?B
t ? M ?B

Results from SR-BI transgenic experiment
44
Extensions include dealing with

Replicates within and between slides
Several effects use a linear model
ANOVA are the effects equal?
Time series selecting genes for trends

45
Rosetta once more In vivo Binding Sites of
Gal4p in Galactose
P lt0.001
Un-enriched DNA (Cy3)
antibody-enriched DNA (Cy5)
46
Summary (for the second simplest problem)

Microarray experiments typically have thousands
of genes, but only few (1-10) replicates for each
gene.
Averages can be driven by outliers.
Ts can be driven by tiny variances.
B LOR will, we hope
use information from all the genes
combine the best of M. and T
avoid the problems of M. and T

47
Acknowledgments

UCB/WEHI
Yee Hwa Yang
Sandrine Dudoit
Ingrid Lönnstedt
Natalie Thorne
David Freedman
CSIRO Image Analysis Group
Michael Buckley
Ryan Lagerstorm

Ngai lab, UCB
Goodman lab, UCB
Peter Mac CI, Melb.
Ernest Gallo CRC
Brown-Botstein lab
Matt Callow (LBNL)
Bing Ren (WI)

Some web sites
Technical reports, talks, software etc.
http//www.stat.berkeley.edu/users/terry/zarray/Ht
ml/
Statistical software R GNUs S
http//lib.stat.cmu.edu/R/CRAN/
Packages within R environment
-- Spot http//www.cmis.csiro.au/iap/spot.htm
-- SMA (statistics for microarray analysis)
http//www.stat.berkeley.edu/users/terry/zarray/So
ftware /smacode.html

49
Factorial Design
Age Effect
2
A1
P01
4
Zone Effect
1
3
5
P04
A 4
50
Factorial design
m
ma
Different ways of estimating parameters. e.g. Z
effect. 1 (m z) - (m) z 2 - 5 ((m
a) - (m)) -((m a)-(m z)) (a) - (a z)
z 4 3 - 5 z
2
A1
P01
4
1
3
5
P04
A 4
mz
mzaza
How do we combine the information?
51
Regression analysis
Define a matrix X so that E(M)X? Use least
squares estimate for z, a, za
52
Looking at effect of Z log(zone 4 / zone1)
gene A
gene B
53
Estimate
Z effect