Title: Canadian Bioinformatics Workshops
1Canadian Bioinformatics Workshops
22
Module Title of Module
3Lecture 8 Microarrays II Data Analysis
MBP1010 Dr. Paul C. Boutros Winter 2014
Aegeus, King of Athens, consulting the Delphic
Oracle. High Classical (430 BCE)
DEPARTMENT OF MEDICAL BIOPHYSICS
This workshop includes material originally
developed by Drs. Raphael Gottardo, Sohrab
Shah, Boris Steipe and others
4Course Overview
- Lecture 1 What is Statistics? Introduction to R
- Lecture 2 Univariate Analyses I continuous
- Lecture 3 Univariate Analyses II discrete
- Lecture 4 Multivariate Analyses I specialized
models - Lecture 5 Multivariate Analyses II general
models - Lecture 6 Sequence Analysis
- Lecture 7 Microarray Analysis I Pre-Processing
- Lecture 8 Microarray Analysis II
Multiple-Testing - Lecture 9 Machine-Learning
- Final Exam (written)
5House Rules
- Cell phones to silent
- No side conversations
- Hands up for questions
6Topics For This Week
- Examples
- Attendance
- Pre-Processing
- QA/QC
- Microarray-Specific Statistics
- ProbeSet remapping
- Organizing omics studies
7Example 1
You are conducting a study of osteosarcomas using
mouse models. You are using a strain of mice that
is naturally susceptible to these tumours at a
frequency of 20. You are studying two
transgenic lines, one of which has a deletion of
a putative tumour suppressor (TS), the other of
which has an amplification of a putative oncogene
(OG). Tumour penetrance in these two lines is
100. Your hypothesis tumours in mice lacking TS
will be smaller than those in mice with
amplification of OG, as assessed by post-mortem
volume measurements of the primary tumour. Your
data
OG (cm3) 5.2 1.9 5.0 6.1 4.5 4.8
TS (cm3) 3.9 7.1 3.1 4.4 5.0
8Example 2
You are conducting a study of osteosarcomas using
mouse models. You are studying transgenic animals
with deletion of a tumour suppressor (TS), or
with amplification of an oncogene (OG). You
consider the penetrance of tumours in a set of 8
different mouse strains. Your hypothesis some
mouse strains are lead to bigger tumours than
others when OG is amplified and only considering
animals in which tumours form. You measure tumour
volume in mm3 using calipers.
Strain 1 (mm3) 91 69 83
Strain 2 (mm3) 201 70 71
Strain 3 (mm3) 15 36 20
Strain 4 (mm3) 52 52 53
Strain 5 (weeks) 11 538 59
Strain 6 (mm3) 6 60 63
Strain 7 (mm3) 85 79 70
Strain 8 (mm3) 100 105 121
9Example 3
You are conducting a study of osteosarcomas using
mouse models. You are using a strain of mice that
is naturally susceptible to these tumours at a
frequency of 20. You are studying two
transgenic lines, one of which has a deletion of
a putative tumour suppressor (TS), the other of
which has an amplification of a putative oncogene
(OG). Tumour penetrance in these two lines is
100. Your hypothesis mice lacking TS are less
likely to respond to a novel targeted therapeutic
(DX) than wildtype animals, as assessed by
molecular imaging
TS (imaging response) Yes No Yes Yes No
WT (imaging response) Yes Yes Yes Yes No Yes
10Example 4
You are conducting a study of osteosarcomas using
mouse models. You are using a strain of mice that
is naturally susceptible to these tumours at a
frequency of 20. You are studying two
transgenic lines, one of which has a deletion of
a putative tumour suppressor (TS), the other of
which has an amplification of a putative oncogene
(OG). Based on your previous data, you now
hypothesize that mice lacking TS will show a
similar molecular response to DX as those with
amplification of OG. You use microarrays to study
20,000 genes in each line, and identify the
following genes as changed between drug-treated
and vehicle-treated
OG (DX-responsive genes) MYC KRAS CD53 CDH1 MUC1 M
ARCH1 PTEN IDH3 ESR2 RHEB CTCF STK11 MLL3 KEAP1 NF
E2L2 ARID1A
TS (DX-responsive genes) MYC KRAS CD53 CDH1 FBW1 S
EPT7 MUC1 MUC3 MUC9 RNF3
11Example 5
You are conducting a study of osteosarcomas using
mouse models. You are using a strain of mice
naturally susceptible to these tumours at 20
penetrance. You are studying two transgenic
lines, one with deletion of a tumour suppressor
(TS), the other with amplification of an oncogene
(OG). Tumour penetrance in these is 100. Your
hypothesis You now wonder if tumour size is
differing by age of the animal, and suspect
tumour-size differs between lines, but is
confounded by age differences. Your data
OG (cm3) 5.2 (17 weeks) 1.9 (9 weeks) 5.0 (15
weeks) 6.1 (15 weeks) 4.5 (21 weeks) 4.8 (20
weeks)
Wildtype (cm3) 1.1 (9 weeks) 1.5 (10 weeks) 2.1
(15 weeks) 2.5 (15 weeks) 0.3 (17 weeks) 2.2 (21
weeks)
TS (cm3) 3.9 (17 weeks) 7.1 (15 weeks) 3.1 (15
weeks) 4.4 (22 weeks) 5.0 (22 weeks)
12Example 6
You are conducting a study of osteosarcomas using
mouse models. You are using a strain of mice that
is naturally susceptible to these tumours at a
frequency of 20. You are studying two
transgenic lines, one of which has a deletion of
a putative tumour suppressor (TS), the other of
which has an amplification of a putative oncogene
(OG). Tumour penetrance in these two lines is
100. Your hypothesis mice lacking TS will
acquire tumours sooner than wildtype mice. You
test the mice weekly using ultrasound imaging.
Your data
TS (week of tumour) 4 7 7 6 5
OG (week of tumour) 3 9 3 2 4 3
13Topics For This Week
- Examples
- Attendance
- Pre-Processing
- QA/QC
- Microarray-Specific Statistics
- ProbeSet remapping
- Organizing omics studies
14Summary Point 1Microarray data is analyzed
with a pipeline of sequential algorithms.This
pipeline defines the standard workflow for
microarray experiments.
15Quantitation
?
16Summary Point 2This is an active research area.
17Summary Point 3These basic steps hold true
for all microarray platforms and types.
18What Is BioConductor?
- Bioconductor is an open source, open development
software project to provide tools for the
analysis and comprehension of high-throughput
genomic data. - - BioConductor website
The vast majority of our analyses will use
BioConductor code, but there are clearly
non-BioConductor approaches.
19Ive outlined the general workflow.Each
technology and application has its own unique
characteristics to consider.
20Lets Define an Affymetrix-Specific Workflow
21Quantitation is done according to Affymetrix
defaults with minimal user intervention.
Quantitation
One-Channel array
Single-Channel array, so one simultaneous
normalization procedure
Typically ignored
?
22Lets Collapse This a Bit And Re-Phrase Things
23.CEL Files
?
24First lets go Back to Pre-Processing
What exactly is pre-processing (aka
normalization)?
Why do we do it?
25Sources of Technical Noise
- Where does technical noise come from?
26More Sources of Technical Noise
27Any step in the experimental pipeline can
introduce artifactual noise
- Array design
- Array manufacturing
- Sample quality
- Sample identity ? sequence effects?
- Sample processing
- Hybridization conditions ? ozone?
- Scanner settings
Pre-Processing tries to remove these systematic
effects
28Important Note
Pre-processing is never a substitute for good
experimental design. This is not a course on
statistical design, but a few basic principles
should be mentioned.
Biological replicates are preferable to technical
replicates.
Always try to balance experimental groups.
If processing samples identically is not
possible, include controls for processing-effects.
29Pre-Processing
What exactly is pre-processing (aka
normalization)?
Why do we do it?
30Sources of Technical Noise
- Where does technical noise come from?
31More Sources of Technical Noise
32Any step in the experimental pipeline can
introduce artifactual noise
- Array design
- Array manufacturing
- Sample quality
- Sample identity ? sequence effects?
- Sample processing
- Hybridization conditions ? ozone?
- Scanner settings
Pre-Processing tries to remove these systematic
effects
33Affymetrix Pre-Processing Steps
- Background Correction
- Normalization
- Probe-Specific Adjustment
- Summarizing multiple Probes into a single ProbeSet
Lets look at two common approaches
34Introducing Two Major Affymetrix Pre-Processing
Methods
- The two most commonly used methods are
- RMA Robust Multi-array
- MAS5 Microarray Analysis Suite version 5
- MAS5 has strengths weaknesses
- Sacrifices precision for accuracy
- Can easily be used in clinical settings
- RMA has strengths weaknesses
- Sacrifices accuracy for precision
- Challenging to integrate multiple studies
- Reduces variance (critical for small-n studies)
- Both are well accepted by journals and reviewers,
perhaps RMA a bit more so. Well talk about some
of the mathematics later on in this course.
35Approach 1 MAS5
- Affymetrix put significant effort into developing
good data pre-processing approaches - MAS5 was an attempt to develop a standard
technique for 3 expression arrays - The flaws of MAS5 led to an influx of research in
this area. - The algorithm is best-described in an Affymetrix
white-paper, and is actually quite challenging to
reproduce exactly in R.
36MAS5 Model
Observations True Signal Random Noise Probe
Effects
Assumptions?
37(No Transcript)
38(No Transcript)
39(No Transcript)
40(No Transcript)
41What is RMA?
RMA Robust Multi-Array
Why do we use a robust method? Robust
summaries really improve over the standard ones
by down weighing outliers and leaving their
effects visible in residuals.
Why do we use array? To put each chips values
in the context of a set of similar values.
42What is RMA?
It is a log scale linear additive model
Assumes all the chips have the same background
distribution
Does not use the mismatch probe (MM) data from
the microarray experiments
Why?
43What is RMA?
Mismatch probes (MM) definitely have information
- about both signal and noise - but using it
without adding more noise is a challenge We
should be able to improve the background
correction using MM, without having the noise
level blow up topic of current research
(GCRMA) Ignoring MM decreases accuracy but
increases precision
44Methodology
Quantile Normalization the goal of this method
is to make the distribution of probe intensities
for each array in a set of arrays the same. This
method is motivated by the idea that a Q-Q plot
shows that the distribution of two data vectors
is the same if the plot is a straight diagonal
line and not the same if it is anything else.
45Methodology
46Methodology
Summarization combining multiple probe
intensities of each probeset to produce
expression values An additive linear model is fit
to the normalized data to obtain an expression
measure for each probe on the GeneChip
Yij aj ßi eij
47Methodology
Yij aj ßi eij
Yij denotes the background-corrected normalized
probe value corresponding to the ith GeneChip and
the jth probe within the probeset
log2(PM-BG)ij
aj is the probe affinity jth probe
ßi is the chip effect for the ith GeneChip (log
scale expression level)
eij is the random error term
48Methodology
Yij aj ßi eij
- Estimate aj ( probe affinity) and ßi (chip
effect) using a robust method - Tukeys Median polish (quick) - fits
iteratively, - successively removing row and column medians,
- and accumulating the terms, until the process
- stabilizes. The residuals are what is left at
the end
49RMA vs. MAS5
- RMA sacrifices accuracy for precision
- RMA is generally not appropriate for clinical
settings - RMA provides higher sensitivity/specificity in
some tests - RMA reduces variance (critical for small-n
studies) - RMA is better accepted by journals and reviewers
50Topics For This Week
- Examples
- Attendance
- Pre-Processing
- QA/QC
- Microarray-Specific Statistics
- ProbeSet remapping
- Organizing omics studies
51One key detail has been omitted so far
How do we know if our pre-processing actually
worked?
52Can we determine how well our pre-processing
worked?Or if our data looks good?
53Lets See Some Bad Data
54(No Transcript)
55(No Transcript)
56(No Transcript)
57Those Three Were From A Spike-In Experiment Done
by Affymetrix
58(No Transcript)
59(No Transcript)
60(No Transcript)
61Those Last Three Were From An Experiment We Did
On Rat Liver Samples
62Were Those Bad Samples?
- Lots of evident spatial artifacts
- But in practice all samples were carried forward
into analysis - And validation (RT-PCR) confirmed the overall
study results for many genes
63Eye-ball Assessments Are Hard
- A couple of useful tricks
- Look at the distributions
- Did quantile normalization work (for RMA)?
- Look at the inter-sample correlations
- Is one sample a strong outlier?
- Look at the 3 ? 5 trend across a ProbeSet
I know of no accepted, systematic QA/QC methods
64Distributions (Raw)
65Distributions (normalized)
66Inter-Sample Correlations
673 ? 5 Signal Trend
68What Do You Do If You Find a Bad Array?
- Repeat it?
- Drop the sample?
- Include it but account for the noise in another
way?
69In This Case
- We excluded a series of outlier samples
- We believed these samples had been badly degraded
because their were derived from FFPE blocks
70Final Distribution
71Final Heatmap
72Topics For This Week
- Examples
- Attendance
- Pre-Processing
- QA/QC
- Microarray-Specific Statistics
- ProbeSet remapping
- Organizing omics studies
73T-tests
- What are the assumptions of the t-test?
- When would you feel comfortable using a t-test?
74T-Test Alternative Wilcoxon Rank-Sum
- Also called
- U-test
- Mann-Whitney (U) test
- Some argue that for continuous microarray data
there is rarely a good reason to use this test - Low n tests of normality are not very powerful
- High n the central limit theorem provides
support - If the sample is normal, asymptotic efficiency is
0.95
75T-Test Alternative Moderated Statistics
- A series of highly complex methods based on
Bayesian statistical methodologies - Gordon Smyths limma R package is by far the most
widely used implementation of this technique
This term is shrunk by borrowing power across
all genes. This increases effective power.
76T-Test Alternative Permutation Tests
- SAM is the classic method
- Most people suggest not using SAM today
- Empirically estimate the null distribution
Iterate
Start with many samples
Randomly Sample
77Problems with Significance Testing
- What happens if there are NO changes?
- Imagine
- You analyzed 1,000 clinical samples
- 20,000 genes in the genome
- P lt 0.05
- What if somebody comes and randomizes all your
data?
78You had a lot of Data
20,000 genes / array
All Randomized
1,000 patients
20,000,000 data points
Genes are mixed up together Patients are mixed
together
What happens if you analyze this data?
There should be NO real hits anymore!
79What will you actually find?
Array 20,000 genes
Threshold p lt 0.05
20,000 x 0.05 1000 False Positives
This is called multiple testing.
There is a solution
8020
A false-discovery rate adjustment (FDR) for
multiple testing considers all 20,000 p-values
simultaneously
15
In this experiment, lots of low p-values, so we
can use this to adjust the p-values so we can
find the true hits.
10
Expected Value
5
0
P-Value
81This is what you get from randomized data
In this experiment, NO enrichment for low
p-values, so no more hits than expected randomly.
82Topics For This Week
- Examples
- Attendance
- Pre-Processing
- QA/QC
- Microarray-Specific Statistics
- ProbeSet remapping
- Organizing omics studies
83The Mask Production Makes Affymetrix Designs
Expensive To Change
Photolithographic mask
84But there are multiple probes per gene
85We Can Change Those Mappings!
Hybridized Chip
86CDF File
- Chip Definition File
- This file maps Probes (positions) into ProbeSets
- We can update those mappings
- Ignore deprecated or cross-hybridizing probes
- Merge multiple probes that recognize the same
gene - Account for entirely new genes that were not
known at the time of array-design
87Sequence Mappings Are Slow
- Requires aligning millions of 25 bp probes
against the transcriptome and identifying the
best match for each - Fortunately, other groups have done this for us,
and regularly update their mappings
88Many Probes Are Lost
89But There Is Also A Major Benefit
Increased validation rates using RT-PCR (10)
Sandberg et al BMC Bioinformatics 2007
90Topics For This Week
- Examples
- Attendance
- Pre-Processing
- QA/QC
- Microarray-Specific Statistics
- ProbeSet remapping
- Organizing omics studies
91What Are The Outputs of A Microarray Study?
These file can be 10s of GB for a typical Affy
study
- Primary Data
- Raw image (.DAT file)
- Quantitation (.CEL file)
- Secondary Data
- Normalized data (usually an ASCII text file)
- QA/QC plots
- Tertiary Data
- Statistical analyses
- Global visualization (e.g. heatmaps)
- Downstream analyses (e.g. pathway,
dataset-integration)
92How Do You Organize These Data?
I recommend you put things on a fast, backed-up
network drive
/data/
Organize data by project
/data/Project
Create separate directories for each analysis
/data/Project/raw /data/Project/QAQC /data/Project
/pre-processing /data/Project/statistical /data/Pr
oject/pathway
93How Do You Organize The Scripts?
I recommend you write a separate script for each
analysis, and put those in a standardized
(backed-up!) location, mirroring the directory
structure and naming of your dataset
directories. Some sub-structure here is often
useful
/scripts/Project/pre-processing.R /scripts/Project
/statistical-univariate.R /scripts/Project/statist
ical-multivariate.R /scripts/Project/pathway/GOMin
er.R /scripts/Project/pathway/Reactome.R /scripts/
Project/integration/mRNACNV.R /scripts/Project/in
tegration/public-data.R
94Why Many Small Scripts?
- Monolithic scripts are hard to maintain
- Easier to make errors
- Accidentally re-using the same variable name
- Harder to debug
- Harder for somebody else to learn
- Small scripts are more flexible
- Quicker to modify/re-run a small part of your
analysis - Easier to re-use the same code on another dataset
- This is akin to the unix mindset of systems
design
95What To Save?
- Everything!!
- All QA/QC plots (common reviewer request)
- All pre-processed data (needed for GEO uploads)
- Gene-wise statistical analyses
- Not just the statistically-significant genes
- Collapse all analyses into one file, though
- All plots/etc
- Using clear filenames is critical
- Disk-space is not usually a critical concern here
- Your raw data will be much larger than your
output!
96Most Important Points
- Do not delete things
- Keep all old versions of your scripts by
including the date in the filename (or using
source-control) - Version output files by date
- I have needed to go back to analyses done 7 years
prior! - Make regular (weekly) backups
- Try to pass this work off to professional
sysadmins - External hard-drives/USBs are okay if you cannot
get access to network drives, but try to automate
97Course Overview
- Lecture 1 What is Statistics? Introduction to R
- Lecture 2 Univariate Analyses I continuous
- Lecture 3 Univariate Analyses II discrete
- Lecture 4 Multivariate Analyses I specialized
models - Lecture 5 Multivariate Analyses II general
models - Lecture 6 Sequence Analysis
- Lecture 7 Microarray Analysis I Pre-Processing
- Lecture 8 Microarray Analysis II
Multiple-Testing - Lecture 9 Machine-Learning
- Final Exam (written)