Canadian Bioinformatics Workshops

About This Presentation

Title:

Canadian Bioinformatics Workshops

Description:

Canadian Bioinformatics Workshops www.bioinformatics.ca Lecture 8: Microarrays Part II bioinformatics.ca Module 1 bioinformatics.ca multi chip? Why does it assume all ... – PowerPoint PPT presentation

Number of Views:170

Avg rating:3.0/5.0

Slides: 98

Provided by: MichaelSt164

Category:

more less

Transcript and Presenter's Notes

Title: Canadian Bioinformatics Workshops

1
Canadian Bioinformatics Workshops

www.bioinformatics.ca

2
2
Module Title of Module
3
Lecture 8 Microarrays II Data Analysis
MBP1010 Dr. Paul C. Boutros Winter 2014

Aegeus, King of Athens, consulting the Delphic
Oracle. High Classical (430 BCE)
DEPARTMENT OF MEDICAL BIOPHYSICS
This workshop includes material originally
developed by Drs. Raphael Gottardo, Sohrab
Shah, Boris Steipe and others

4
Course Overview

Lecture 1 What is Statistics? Introduction to R
Lecture 2 Univariate Analyses I continuous
Lecture 3 Univariate Analyses II discrete
Lecture 4 Multivariate Analyses I specialized
models
Lecture 5 Multivariate Analyses II general
models
Lecture 6 Sequence Analysis
Lecture 7 Microarray Analysis I Pre-Processing
Lecture 8 Microarray Analysis II
Multiple-Testing
Lecture 9 Machine-Learning
Final Exam (written)

5
House Rules

Cell phones to silent
No side conversations
Hands up for questions

6
Topics For This Week

Examples
Attendance
Pre-Processing
QA/QC
Microarray-Specific Statistics
ProbeSet remapping
Organizing omics studies

7
Example 1
You are conducting a study of osteosarcomas using
mouse models. You are using a strain of mice that
is naturally susceptible to these tumours at a
frequency of 20. You are studying two
transgenic lines, one of which has a deletion of
a putative tumour suppressor (TS), the other of
which has an amplification of a putative oncogene
(OG). Tumour penetrance in these two lines is
100. Your hypothesis tumours in mice lacking TS
will be smaller than those in mice with
amplification of OG, as assessed by post-mortem
volume measurements of the primary tumour. Your
data
OG (cm3) 5.2 1.9 5.0 6.1 4.5 4.8
TS (cm3) 3.9 7.1 3.1 4.4 5.0
8
Example 2
You are conducting a study of osteosarcomas using
mouse models. You are studying transgenic animals
with deletion of a tumour suppressor (TS), or
with amplification of an oncogene (OG). You
consider the penetrance of tumours in a set of 8
different mouse strains. Your hypothesis some
mouse strains are lead to bigger tumours than
others when OG is amplified and only considering
animals in which tumours form. You measure tumour
volume in mm3 using calipers.
Strain 1 (mm3) 91 69 83
Strain 2 (mm3) 201 70 71
Strain 3 (mm3) 15 36 20
Strain 4 (mm3) 52 52 53
Strain 5 (weeks) 11 538 59
Strain 6 (mm3) 6 60 63
Strain 7 (mm3) 85 79 70
Strain 8 (mm3) 100 105 121
9
Example 3
You are conducting a study of osteosarcomas using
mouse models. You are using a strain of mice that
is naturally susceptible to these tumours at a
frequency of 20. You are studying two
transgenic lines, one of which has a deletion of
a putative tumour suppressor (TS), the other of
which has an amplification of a putative oncogene
(OG). Tumour penetrance in these two lines is
100. Your hypothesis mice lacking TS are less
likely to respond to a novel targeted therapeutic
(DX) than wildtype animals, as assessed by
molecular imaging
TS (imaging response) Yes No Yes Yes No
WT (imaging response) Yes Yes Yes Yes No Yes
10
Example 4
You are conducting a study of osteosarcomas using
mouse models. You are using a strain of mice that
is naturally susceptible to these tumours at a
frequency of 20. You are studying two
transgenic lines, one of which has a deletion of
a putative tumour suppressor (TS), the other of
which has an amplification of a putative oncogene
(OG). Based on your previous data, you now
hypothesize that mice lacking TS will show a
similar molecular response to DX as those with
amplification of OG. You use microarrays to study
20,000 genes in each line, and identify the
following genes as changed between drug-treated
and vehicle-treated
OG (DX-responsive genes) MYC KRAS CD53 CDH1 MUC1 M
ARCH1 PTEN IDH3 ESR2 RHEB CTCF STK11 MLL3 KEAP1 NF
E2L2 ARID1A
TS (DX-responsive genes) MYC KRAS CD53 CDH1 FBW1 S
EPT7 MUC1 MUC3 MUC9 RNF3
11
Example 5
You are conducting a study of osteosarcomas using
mouse models. You are using a strain of mice
naturally susceptible to these tumours at 20
penetrance. You are studying two transgenic
lines, one with deletion of a tumour suppressor
(TS), the other with amplification of an oncogene
(OG). Tumour penetrance in these is 100. Your
hypothesis You now wonder if tumour size is
differing by age of the animal, and suspect
tumour-size differs between lines, but is
confounded by age differences. Your data
OG (cm3) 5.2 (17 weeks) 1.9 (9 weeks) 5.0 (15
weeks) 6.1 (15 weeks) 4.5 (21 weeks) 4.8 (20
weeks)
Wildtype (cm3) 1.1 (9 weeks) 1.5 (10 weeks) 2.1
(15 weeks) 2.5 (15 weeks) 0.3 (17 weeks) 2.2 (21
weeks)
TS (cm3) 3.9 (17 weeks) 7.1 (15 weeks) 3.1 (15
weeks) 4.4 (22 weeks) 5.0 (22 weeks)
12
Example 6
You are conducting a study of osteosarcomas using
mouse models. You are using a strain of mice that
is naturally susceptible to these tumours at a
frequency of 20. You are studying two
transgenic lines, one of which has a deletion of
a putative tumour suppressor (TS), the other of
which has an amplification of a putative oncogene
(OG). Tumour penetrance in these two lines is
100. Your hypothesis mice lacking TS will
acquire tumours sooner than wildtype mice. You
test the mice weekly using ultrasound imaging.
Your data
TS (week of tumour) 4 7 7 6 5
OG (week of tumour) 3 9 3 2 4 3
13
Topics For This Week

Examples
Attendance
Pre-Processing
QA/QC
Microarray-Specific Statistics
ProbeSet remapping
Organizing omics studies

14
Summary Point 1Microarray data is analyzed
with a pipeline of sequential algorithms.This
pipeline defines the standard workflow for
microarray experiments.
15
Quantitation
?
16
Summary Point 2This is an active research area.
17
Summary Point 3These basic steps hold true
for all microarray platforms and types.
18
What Is BioConductor?

Bioconductor is an open source, open development
software project to provide tools for the
analysis and comprehension of high-throughput
genomic data.
- BioConductor website

The vast majority of our analyses will use
BioConductor code, but there are clearly
non-BioConductor approaches.
19
Ive outlined the general workflow.Each
technology and application has its own unique
characteristics to consider.
20
Lets Define an Affymetrix-Specific Workflow
21
Quantitation is done according to Affymetrix
defaults with minimal user intervention.
Quantitation
One-Channel array
Single-Channel array, so one simultaneous
normalization procedure
Typically ignored
?
22
Lets Collapse This a Bit And Re-Phrase Things
23
.CEL Files
?
24
First lets go Back to Pre-Processing
What exactly is pre-processing (aka
normalization)?
Why do we do it?
25
Sources of Technical Noise

Where does technical noise come from?

26
More Sources of Technical Noise
27
Any step in the experimental pipeline can
introduce artifactual noise

Array design
Array manufacturing
Sample quality
Sample identity ? sequence effects?
Sample processing
Hybridization conditions ? ozone?
Scanner settings

Pre-Processing tries to remove these systematic
effects
28
Important Note
Pre-processing is never a substitute for good
experimental design. This is not a course on
statistical design, but a few basic principles
should be mentioned.
Biological replicates are preferable to technical
replicates.
Always try to balance experimental groups.
If processing samples identically is not
possible, include controls for processing-effects.
29
Pre-Processing
What exactly is pre-processing (aka
normalization)?
Why do we do it?
30
Sources of Technical Noise

Where does technical noise come from?

31
More Sources of Technical Noise
32
Any step in the experimental pipeline can
introduce artifactual noise

Array design
Array manufacturing
Sample quality
Sample identity ? sequence effects?
Sample processing
Hybridization conditions ? ozone?
Scanner settings

Pre-Processing tries to remove these systematic
effects
33
Affymetrix Pre-Processing Steps

Background Correction
Normalization
Probe-Specific Adjustment
Summarizing multiple Probes into a single ProbeSet

Lets look at two common approaches
34
Introducing Two Major Affymetrix Pre-Processing
Methods

The two most commonly used methods are
RMA Robust Multi-array
MAS5 Microarray Analysis Suite version 5
MAS5 has strengths weaknesses
Sacrifices precision for accuracy
Can easily be used in clinical settings
RMA has strengths weaknesses
Sacrifices accuracy for precision
Challenging to integrate multiple studies
Reduces variance (critical for small-n studies)
Both are well accepted by journals and reviewers,
perhaps RMA a bit more so. Well talk about some
of the mathematics later on in this course.

35
Approach 1 MAS5

Affymetrix put significant effort into developing
good data pre-processing approaches
MAS5 was an attempt to develop a standard
technique for 3 expression arrays
The flaws of MAS5 led to an influx of research in
this area.
The algorithm is best-described in an Affymetrix
white-paper, and is actually quite challenging to
reproduce exactly in R.

36
MAS5 Model
Observations True Signal Random Noise Probe
Effects
Assumptions?
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
What is RMA?
RMA Robust Multi-Array
Why do we use a robust method? Robust
summaries really improve over the standard ones
by down weighing outliers and leaving their
effects visible in residuals.
Why do we use array? To put each chips values
in the context of a set of similar values.
42
What is RMA?
It is a log scale linear additive model
Assumes all the chips have the same background
distribution
Does not use the mismatch probe (MM) data from
the microarray experiments
Why?
43
What is RMA?
Mismatch probes (MM) definitely have information
- about both signal and noise - but using it
without adding more noise is a challenge We
should be able to improve the background
correction using MM, without having the noise
level blow up topic of current research
(GCRMA) Ignoring MM decreases accuracy but
increases precision
44
Methodology
Quantile Normalization the goal of this method
is to make the distribution of probe intensities
for each array in a set of arrays the same. This
method is motivated by the idea that a Q-Q plot
shows that the distribution of two data vectors
is the same if the plot is a straight diagonal
line and not the same if it is anything else.
45
Methodology
46
Methodology
Summarization combining multiple probe
intensities of each probeset to produce
expression values An additive linear model is fit
to the normalized data to obtain an expression
measure for each probe on the GeneChip
Yij aj ßi eij
47
Methodology
Yij aj ßi eij
Yij denotes the background-corrected normalized
probe value corresponding to the ith GeneChip and
the jth probe within the probeset
log2(PM-BG)ij
aj is the probe affinity jth probe
ßi is the chip effect for the ith GeneChip (log
scale expression level)
eij is the random error term
48
Methodology
Yij aj ßi eij

Estimate aj ( probe affinity) and ßi (chip
effect) using a robust method
Tukeys Median polish (quick) - fits
iteratively,
successively removing row and column medians,
and accumulating the terms, until the process
stabilizes. The residuals are what is left at
the end

49
RMA vs. MAS5

RMA sacrifices accuracy for precision
RMA is generally not appropriate for clinical
settings
RMA provides higher sensitivity/specificity in
some tests
RMA reduces variance (critical for small-n
studies)
RMA is better accepted by journals and reviewers

50
Topics For This Week

Examples
Attendance
Pre-Processing
QA/QC
Microarray-Specific Statistics
ProbeSet remapping
Organizing omics studies

51
One key detail has been omitted so far
How do we know if our pre-processing actually
worked?
52
Can we determine how well our pre-processing
worked?Or if our data looks good?
53
Lets See Some Bad Data
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
Those Three Were From A Spike-In Experiment Done
by Affymetrix
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
Those Last Three Were From An Experiment We Did
On Rat Liver Samples
62
Were Those Bad Samples?

Lots of evident spatial artifacts
But in practice all samples were carried forward
into analysis
And validation (RT-PCR) confirmed the overall
study results for many genes

63
Eye-ball Assessments Are Hard

A couple of useful tricks
Look at the distributions
Did quantile normalization work (for RMA)?
Look at the inter-sample correlations
Is one sample a strong outlier?
Look at the 3 ? 5 trend across a ProbeSet

I know of no accepted, systematic QA/QC methods
64
Distributions (Raw)
65
Distributions (normalized)
66
Inter-Sample Correlations
67
3 ? 5 Signal Trend
68
What Do You Do If You Find a Bad Array?

Repeat it?
Drop the sample?
Include it but account for the noise in another
way?

69
In This Case

We excluded a series of outlier samples
We believed these samples had been badly degraded
because their were derived from FFPE blocks

70
Final Distribution
71
Final Heatmap
72
Topics For This Week

Examples
Attendance
Pre-Processing
QA/QC
Microarray-Specific Statistics
ProbeSet remapping
Organizing omics studies

73
T-tests

What are the assumptions of the t-test?
When would you feel comfortable using a t-test?

74
T-Test Alternative Wilcoxon Rank-Sum

Also called
U-test
Mann-Whitney (U) test
Some argue that for continuous microarray data
there is rarely a good reason to use this test
Low n tests of normality are not very powerful
High n the central limit theorem provides
support
If the sample is normal, asymptotic efficiency is
0.95

75
T-Test Alternative Moderated Statistics

A series of highly complex methods based on
Bayesian statistical methodologies
Gordon Smyths limma R package is by far the most
widely used implementation of this technique

This term is shrunk by borrowing power across
all genes. This increases effective power.
76
T-Test Alternative Permutation Tests

SAM is the classic method
Most people suggest not using SAM today
Empirically estimate the null distribution

Iterate
Start with many samples
Randomly Sample
77
Problems with Significance Testing

What happens if there are NO changes?
Imagine
You analyzed 1,000 clinical samples
20,000 genes in the genome
P lt 0.05
What if somebody comes and randomizes all your
data?

78
You had a lot of Data
20,000 genes / array
All Randomized
1,000 patients
20,000,000 data points
Genes are mixed up together Patients are mixed
together
What happens if you analyze this data?
There should be NO real hits anymore!
79
What will you actually find?
Array 20,000 genes
Threshold p lt 0.05
20,000 x 0.05 1000 False Positives
This is called multiple testing.
There is a solution
80
20
A false-discovery rate adjustment (FDR) for
multiple testing considers all 20,000 p-values
simultaneously
15
In this experiment, lots of low p-values, so we
can use this to adjust the p-values so we can
find the true hits.
10
Expected Value
5
0
P-Value
81
This is what you get from randomized data
In this experiment, NO enrichment for low
p-values, so no more hits than expected randomly.
82
Topics For This Week

Examples
Attendance
Pre-Processing
QA/QC
Microarray-Specific Statistics
ProbeSet remapping
Organizing omics studies

83
The Mask Production Makes Affymetrix Designs
Expensive To Change
Photolithographic mask
84
But there are multiple probes per gene
85
We Can Change Those Mappings!
Hybridized Chip
86
CDF File

Chip Definition File
This file maps Probes (positions) into ProbeSets
We can update those mappings
Ignore deprecated or cross-hybridizing probes
Merge multiple probes that recognize the same
gene
Account for entirely new genes that were not
known at the time of array-design

87
Sequence Mappings Are Slow

Requires aligning millions of 25 bp probes
against the transcriptome and identifying the
best match for each
Fortunately, other groups have done this for us,
and regularly update their mappings

88
Many Probes Are Lost
89
But There Is Also A Major Benefit
Increased validation rates using RT-PCR (10)
Sandberg et al BMC Bioinformatics 2007
90
Topics For This Week

Examples
Attendance
Pre-Processing
QA/QC
Microarray-Specific Statistics
ProbeSet remapping
Organizing omics studies

91
What Are The Outputs of A Microarray Study?
These file can be 10s of GB for a typical Affy
study

Primary Data
Raw image (.DAT file)
Quantitation (.CEL file)
Secondary Data
Normalized data (usually an ASCII text file)
QA/QC plots
Tertiary Data
Statistical analyses
Global visualization (e.g. heatmaps)
Downstream analyses (e.g. pathway,
dataset-integration)

92
How Do You Organize These Data?
I recommend you put things on a fast, backed-up
network drive
/data/
Organize data by project
/data/Project
Create separate directories for each analysis
/data/Project/raw /data/Project/QAQC /data/Project
/pre-processing /data/Project/statistical /data/Pr
oject/pathway
93
How Do You Organize The Scripts?
I recommend you write a separate script for each
analysis, and put those in a standardized
(backed-up!) location, mirroring the directory
structure and naming of your dataset
directories. Some sub-structure here is often
useful
/scripts/Project/pre-processing.R /scripts/Project
/statistical-univariate.R /scripts/Project/statist
ical-multivariate.R /scripts/Project/pathway/GOMin
er.R /scripts/Project/pathway/Reactome.R /scripts/
Project/integration/mRNACNV.R /scripts/Project/in
tegration/public-data.R
94
Why Many Small Scripts?

Monolithic scripts are hard to maintain
Easier to make errors
Accidentally re-using the same variable name
Harder to debug
Harder for somebody else to learn
Small scripts are more flexible
Quicker to modify/re-run a small part of your
analysis
Easier to re-use the same code on another dataset
This is akin to the unix mindset of systems
design

95
What To Save?

Everything!!
All QA/QC plots (common reviewer request)
All pre-processed data (needed for GEO uploads)
Gene-wise statistical analyses
Not just the statistically-significant genes
Collapse all analyses into one file, though
All plots/etc
Using clear filenames is critical
Disk-space is not usually a critical concern here
Your raw data will be much larger than your
output!

96
Most Important Points

Do not delete things
Keep all old versions of your scripts by
including the date in the filename (or using
source-control)
Version output files by date
I have needed to go back to analyses done 7 years
prior!
Make regular (weekly) backups
Try to pass this work off to professional
sysadmins
External hard-drives/USBs are okay if you cannot
get access to network drives, but try to automate

97
Course Overview