Canadian Bioinformatics Workshops - PowerPoint PPT Presentation

1 / 97
About This Presentation
Title:

Canadian Bioinformatics Workshops

Description:

Canadian Bioinformatics Workshops www.bioinformatics.ca Lecture 8: Microarrays Part II bioinformatics.ca Module 1 bioinformatics.ca multi chip? Why does it assume all ... – PowerPoint PPT presentation

Number of Views:167
Avg rating:3.0/5.0
Slides: 98
Provided by: MichaelSt164
Category:

less

Transcript and Presenter's Notes

Title: Canadian Bioinformatics Workshops


1
Canadian Bioinformatics Workshops
  • www.bioinformatics.ca

2
2
Module Title of Module
3
Lecture 8 Microarrays II Data Analysis
MBP1010 Dr. Paul C. Boutros Winter 2014

Aegeus, King of Athens, consulting the Delphic
Oracle. High Classical (430 BCE)
DEPARTMENT OF MEDICAL BIOPHYSICS
This workshop includes material originally
developed by Drs. Raphael Gottardo, Sohrab
Shah, Boris Steipe and others

4
Course Overview
  • Lecture 1 What is Statistics? Introduction to R
  • Lecture 2 Univariate Analyses I continuous
  • Lecture 3 Univariate Analyses II discrete
  • Lecture 4 Multivariate Analyses I specialized
    models
  • Lecture 5 Multivariate Analyses II general
    models
  • Lecture 6 Sequence Analysis
  • Lecture 7 Microarray Analysis I Pre-Processing
  • Lecture 8 Microarray Analysis II
    Multiple-Testing
  • Lecture 9 Machine-Learning
  • Final Exam (written)

5
House Rules
  • Cell phones to silent
  • No side conversations
  • Hands up for questions

6
Topics For This Week
  • Examples
  • Attendance
  • Pre-Processing
  • QA/QC
  • Microarray-Specific Statistics
  • ProbeSet remapping
  • Organizing omics studies

7
Example 1
You are conducting a study of osteosarcomas using
mouse models. You are using a strain of mice that
is naturally susceptible to these tumours at a
frequency of 20. You are studying two
transgenic lines, one of which has a deletion of
a putative tumour suppressor (TS), the other of
which has an amplification of a putative oncogene
(OG). Tumour penetrance in these two lines is
100. Your hypothesis tumours in mice lacking TS
will be smaller than those in mice with
amplification of OG, as assessed by post-mortem
volume measurements of the primary tumour. Your
data
OG (cm3) 5.2 1.9 5.0 6.1 4.5 4.8
TS (cm3) 3.9 7.1 3.1 4.4 5.0
8
Example 2
You are conducting a study of osteosarcomas using
mouse models. You are studying transgenic animals
with deletion of a tumour suppressor (TS), or
with amplification of an oncogene (OG). You
consider the penetrance of tumours in a set of 8
different mouse strains. Your hypothesis some
mouse strains are lead to bigger tumours than
others when OG is amplified and only considering
animals in which tumours form. You measure tumour
volume in mm3 using calipers.
Strain 1 (mm3) 91 69 83
Strain 2 (mm3) 201 70 71
Strain 3 (mm3) 15 36 20
Strain 4 (mm3) 52 52 53
Strain 5 (weeks) 11 538 59
Strain 6 (mm3) 6 60 63
Strain 7 (mm3) 85 79 70
Strain 8 (mm3) 100 105 121
9
Example 3
You are conducting a study of osteosarcomas using
mouse models. You are using a strain of mice that
is naturally susceptible to these tumours at a
frequency of 20. You are studying two
transgenic lines, one of which has a deletion of
a putative tumour suppressor (TS), the other of
which has an amplification of a putative oncogene
(OG). Tumour penetrance in these two lines is
100. Your hypothesis mice lacking TS are less
likely to respond to a novel targeted therapeutic
(DX) than wildtype animals, as assessed by
molecular imaging
TS (imaging response) Yes No Yes Yes No
WT (imaging response) Yes Yes Yes Yes No Yes
10
Example 4
You are conducting a study of osteosarcomas using
mouse models. You are using a strain of mice that
is naturally susceptible to these tumours at a
frequency of 20. You are studying two
transgenic lines, one of which has a deletion of
a putative tumour suppressor (TS), the other of
which has an amplification of a putative oncogene
(OG). Based on your previous data, you now
hypothesize that mice lacking TS will show a
similar molecular response to DX as those with
amplification of OG. You use microarrays to study
20,000 genes in each line, and identify the
following genes as changed between drug-treated
and vehicle-treated
OG (DX-responsive genes) MYC KRAS CD53 CDH1 MUC1 M
ARCH1 PTEN IDH3 ESR2 RHEB CTCF STK11 MLL3 KEAP1 NF
E2L2 ARID1A
TS (DX-responsive genes) MYC KRAS CD53 CDH1 FBW1 S
EPT7 MUC1 MUC3 MUC9 RNF3
11
Example 5
You are conducting a study of osteosarcomas using
mouse models. You are using a strain of mice
naturally susceptible to these tumours at 20
penetrance. You are studying two transgenic
lines, one with deletion of a tumour suppressor
(TS), the other with amplification of an oncogene
(OG). Tumour penetrance in these is 100. Your
hypothesis You now wonder if tumour size is
differing by age of the animal, and suspect
tumour-size differs between lines, but is
confounded by age differences. Your data
OG (cm3) 5.2 (17 weeks) 1.9 (9 weeks) 5.0 (15
weeks) 6.1 (15 weeks) 4.5 (21 weeks) 4.8 (20
weeks)
Wildtype (cm3) 1.1 (9 weeks) 1.5 (10 weeks) 2.1
(15 weeks) 2.5 (15 weeks) 0.3 (17 weeks) 2.2 (21
weeks)
TS (cm3) 3.9 (17 weeks) 7.1 (15 weeks) 3.1 (15
weeks) 4.4 (22 weeks) 5.0 (22 weeks)
12
Example 6
You are conducting a study of osteosarcomas using
mouse models. You are using a strain of mice that
is naturally susceptible to these tumours at a
frequency of 20. You are studying two
transgenic lines, one of which has a deletion of
a putative tumour suppressor (TS), the other of
which has an amplification of a putative oncogene
(OG). Tumour penetrance in these two lines is
100. Your hypothesis mice lacking TS will
acquire tumours sooner than wildtype mice. You
test the mice weekly using ultrasound imaging.
Your data
TS (week of tumour) 4 7 7 6 5
OG (week of tumour) 3 9 3 2 4 3
13
Topics For This Week
  • Examples
  • Attendance
  • Pre-Processing
  • QA/QC
  • Microarray-Specific Statistics
  • ProbeSet remapping
  • Organizing omics studies

14
Summary Point 1Microarray data is analyzed
with a pipeline of sequential algorithms.This
pipeline defines the standard workflow for
microarray experiments.
15
Quantitation
?
16
Summary Point 2This is an active research area.
17
Summary Point 3These basic steps hold true
for all microarray platforms and types.
18
What Is BioConductor?
  • Bioconductor is an open source, open development
    software project to provide tools for the
    analysis and comprehension of high-throughput
    genomic data.
  • - BioConductor website

The vast majority of our analyses will use
BioConductor code, but there are clearly
non-BioConductor approaches.
19
Ive outlined the general workflow.Each
technology and application has its own unique
characteristics to consider.
20
Lets Define an Affymetrix-Specific Workflow
21
Quantitation is done according to Affymetrix
defaults with minimal user intervention.
Quantitation
One-Channel array
Single-Channel array, so one simultaneous
normalization procedure
Typically ignored
?
22
Lets Collapse This a Bit And Re-Phrase Things
23
.CEL Files
?
24
First lets go Back to Pre-Processing
What exactly is pre-processing (aka
normalization)?
Why do we do it?
25
Sources of Technical Noise
  • Where does technical noise come from?

26
More Sources of Technical Noise
27
Any step in the experimental pipeline can
introduce artifactual noise
  • Array design
  • Array manufacturing
  • Sample quality
  • Sample identity ? sequence effects?
  • Sample processing
  • Hybridization conditions ? ozone?
  • Scanner settings

Pre-Processing tries to remove these systematic
effects
28
Important Note
Pre-processing is never a substitute for good
experimental design. This is not a course on
statistical design, but a few basic principles
should be mentioned.
Biological replicates are preferable to technical
replicates.
Always try to balance experimental groups.
If processing samples identically is not
possible, include controls for processing-effects.
29
Pre-Processing
What exactly is pre-processing (aka
normalization)?
Why do we do it?
30
Sources of Technical Noise
  • Where does technical noise come from?

31
More Sources of Technical Noise
32
Any step in the experimental pipeline can
introduce artifactual noise
  • Array design
  • Array manufacturing
  • Sample quality
  • Sample identity ? sequence effects?
  • Sample processing
  • Hybridization conditions ? ozone?
  • Scanner settings

Pre-Processing tries to remove these systematic
effects
33
Affymetrix Pre-Processing Steps
  1. Background Correction
  2. Normalization
  3. Probe-Specific Adjustment
  4. Summarizing multiple Probes into a single ProbeSet

Lets look at two common approaches
34
Introducing Two Major Affymetrix Pre-Processing
Methods
  • The two most commonly used methods are
  • RMA Robust Multi-array
  • MAS5 Microarray Analysis Suite version 5
  • MAS5 has strengths weaknesses
  • Sacrifices precision for accuracy
  • Can easily be used in clinical settings
  • RMA has strengths weaknesses
  • Sacrifices accuracy for precision
  • Challenging to integrate multiple studies
  • Reduces variance (critical for small-n studies)
  • Both are well accepted by journals and reviewers,
    perhaps RMA a bit more so. Well talk about some
    of the mathematics later on in this course.

35
Approach 1 MAS5
  • Affymetrix put significant effort into developing
    good data pre-processing approaches
  • MAS5 was an attempt to develop a standard
    technique for 3 expression arrays
  • The flaws of MAS5 led to an influx of research in
    this area.
  • The algorithm is best-described in an Affymetrix
    white-paper, and is actually quite challenging to
    reproduce exactly in R.

36
MAS5 Model
Observations True Signal Random Noise Probe
Effects
Assumptions?
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
What is RMA?
RMA Robust Multi-Array
Why do we use a robust method? Robust
summaries really improve over the standard ones
by down weighing outliers and leaving their
effects visible in residuals.
Why do we use array? To put each chips values
in the context of a set of similar values.
42
What is RMA?
It is a log scale linear additive model
Assumes all the chips have the same background
distribution
Does not use the mismatch probe (MM) data from
the microarray experiments
Why?
43
What is RMA?
Mismatch probes (MM) definitely have information
- about both signal and noise - but using it
without adding more noise is a challenge We
should be able to improve the background
correction using MM, without having the noise
level blow up topic of current research
(GCRMA) Ignoring MM decreases accuracy but
increases precision
44
Methodology
Quantile Normalization the goal of this method
is to make the distribution of probe intensities
for each array in a set of arrays the same. This
method is motivated by the idea that a Q-Q plot
shows that the distribution of two data vectors
is the same if the plot is a straight diagonal
line and not the same if it is anything else.
45
Methodology
46
Methodology
Summarization combining multiple probe
intensities of each probeset to produce
expression values An additive linear model is fit
to the normalized data to obtain an expression
measure for each probe on the GeneChip
Yij aj ßi eij
47
Methodology
Yij aj ßi eij
Yij denotes the background-corrected normalized
probe value corresponding to the ith GeneChip and
the jth probe within the probeset
log2(PM-BG)ij
aj is the probe affinity jth probe
ßi is the chip effect for the ith GeneChip (log
scale expression level)
eij is the random error term
48
Methodology
Yij aj ßi eij
  • Estimate aj ( probe affinity) and ßi (chip
    effect) using a robust method
  • Tukeys Median polish (quick) - fits
    iteratively,
  • successively removing row and column medians,
  • and accumulating the terms, until the process
  • stabilizes. The residuals are what is left at
    the end

49
RMA vs. MAS5
  • RMA sacrifices accuracy for precision
  • RMA is generally not appropriate for clinical
    settings
  • RMA provides higher sensitivity/specificity in
    some tests
  • RMA reduces variance (critical for small-n
    studies)
  • RMA is better accepted by journals and reviewers

50
Topics For This Week
  • Examples
  • Attendance
  • Pre-Processing
  • QA/QC
  • Microarray-Specific Statistics
  • ProbeSet remapping
  • Organizing omics studies

51
One key detail has been omitted so far
How do we know if our pre-processing actually
worked?
52
Can we determine how well our pre-processing
worked?Or if our data looks good?
53
Lets See Some Bad Data
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
Those Three Were From A Spike-In Experiment Done
by Affymetrix
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
Those Last Three Were From An Experiment We Did
On Rat Liver Samples
62
Were Those Bad Samples?
  • Lots of evident spatial artifacts
  • But in practice all samples were carried forward
    into analysis
  • And validation (RT-PCR) confirmed the overall
    study results for many genes

63
Eye-ball Assessments Are Hard
  • A couple of useful tricks
  • Look at the distributions
  • Did quantile normalization work (for RMA)?
  • Look at the inter-sample correlations
  • Is one sample a strong outlier?
  • Look at the 3 ? 5 trend across a ProbeSet

I know of no accepted, systematic QA/QC methods
64
Distributions (Raw)
65
Distributions (normalized)
66
Inter-Sample Correlations
67
3 ? 5 Signal Trend
68
What Do You Do If You Find a Bad Array?
  • Repeat it?
  • Drop the sample?
  • Include it but account for the noise in another
    way?

69
In This Case
  • We excluded a series of outlier samples
  • We believed these samples had been badly degraded
    because their were derived from FFPE blocks

70
Final Distribution
71
Final Heatmap
72
Topics For This Week
  • Examples
  • Attendance
  • Pre-Processing
  • QA/QC
  • Microarray-Specific Statistics
  • ProbeSet remapping
  • Organizing omics studies

73
T-tests
  • What are the assumptions of the t-test?
  • When would you feel comfortable using a t-test?

74
T-Test Alternative Wilcoxon Rank-Sum
  • Also called
  • U-test
  • Mann-Whitney (U) test
  • Some argue that for continuous microarray data
    there is rarely a good reason to use this test
  • Low n tests of normality are not very powerful
  • High n the central limit theorem provides
    support
  • If the sample is normal, asymptotic efficiency is
    0.95

75
T-Test Alternative Moderated Statistics
  • A series of highly complex methods based on
    Bayesian statistical methodologies
  • Gordon Smyths limma R package is by far the most
    widely used implementation of this technique

This term is shrunk by borrowing power across
all genes. This increases effective power.
76
T-Test Alternative Permutation Tests
  • SAM is the classic method
  • Most people suggest not using SAM today
  • Empirically estimate the null distribution

Iterate
Start with many samples
Randomly Sample
77
Problems with Significance Testing
  • What happens if there are NO changes?
  • Imagine
  • You analyzed 1,000 clinical samples
  • 20,000 genes in the genome
  • P lt 0.05
  • What if somebody comes and randomizes all your
    data?

78
You had a lot of Data
20,000 genes / array
All Randomized
1,000 patients
20,000,000 data points
Genes are mixed up together Patients are mixed
together
What happens if you analyze this data?
There should be NO real hits anymore!
79
What will you actually find?
Array 20,000 genes
Threshold p lt 0.05
20,000 x 0.05 1000 False Positives
This is called multiple testing.
There is a solution
80
20
A false-discovery rate adjustment (FDR) for
multiple testing considers all 20,000 p-values
simultaneously
15
In this experiment, lots of low p-values, so we
can use this to adjust the p-values so we can
find the true hits.
10
Expected Value
5
0
P-Value
81
This is what you get from randomized data
In this experiment, NO enrichment for low
p-values, so no more hits than expected randomly.
82
Topics For This Week
  • Examples
  • Attendance
  • Pre-Processing
  • QA/QC
  • Microarray-Specific Statistics
  • ProbeSet remapping
  • Organizing omics studies

83
The Mask Production Makes Affymetrix Designs
Expensive To Change
Photolithographic mask
84
But there are multiple probes per gene
85
We Can Change Those Mappings!
Hybridized Chip
86
CDF File
  • Chip Definition File
  • This file maps Probes (positions) into ProbeSets
  • We can update those mappings
  • Ignore deprecated or cross-hybridizing probes
  • Merge multiple probes that recognize the same
    gene
  • Account for entirely new genes that were not
    known at the time of array-design

87
Sequence Mappings Are Slow
  • Requires aligning millions of 25 bp probes
    against the transcriptome and identifying the
    best match for each
  • Fortunately, other groups have done this for us,
    and regularly update their mappings

88
Many Probes Are Lost
89
But There Is Also A Major Benefit
Increased validation rates using RT-PCR (10)
 Sandberg et al BMC Bioinformatics 2007
90
Topics For This Week
  • Examples
  • Attendance
  • Pre-Processing
  • QA/QC
  • Microarray-Specific Statistics
  • ProbeSet remapping
  • Organizing omics studies

91
What Are The Outputs of A Microarray Study?
These file can be 10s of GB for a typical Affy
study
  • Primary Data
  • Raw image (.DAT file)
  • Quantitation (.CEL file)
  • Secondary Data
  • Normalized data (usually an ASCII text file)
  • QA/QC plots
  • Tertiary Data
  • Statistical analyses
  • Global visualization (e.g. heatmaps)
  • Downstream analyses (e.g. pathway,
    dataset-integration)

92
How Do You Organize These Data?
I recommend you put things on a fast, backed-up
network drive
/data/
Organize data by project
/data/Project
Create separate directories for each analysis
/data/Project/raw /data/Project/QAQC /data/Project
/pre-processing /data/Project/statistical /data/Pr
oject/pathway
93
How Do You Organize The Scripts?
I recommend you write a separate script for each
analysis, and put those in a standardized
(backed-up!) location, mirroring the directory
structure and naming of your dataset
directories. Some sub-structure here is often
useful
/scripts/Project/pre-processing.R /scripts/Project
/statistical-univariate.R /scripts/Project/statist
ical-multivariate.R /scripts/Project/pathway/GOMin
er.R /scripts/Project/pathway/Reactome.R /scripts/
Project/integration/mRNACNV.R /scripts/Project/in
tegration/public-data.R
94
Why Many Small Scripts?
  • Monolithic scripts are hard to maintain
  • Easier to make errors
  • Accidentally re-using the same variable name
  • Harder to debug
  • Harder for somebody else to learn
  • Small scripts are more flexible
  • Quicker to modify/re-run a small part of your
    analysis
  • Easier to re-use the same code on another dataset
  • This is akin to the unix mindset of systems
    design

95
What To Save?
  • Everything!!
  • All QA/QC plots (common reviewer request)
  • All pre-processed data (needed for GEO uploads)
  • Gene-wise statistical analyses
  • Not just the statistically-significant genes
  • Collapse all analyses into one file, though
  • All plots/etc
  • Using clear filenames is critical
  • Disk-space is not usually a critical concern here
  • Your raw data will be much larger than your
    output!

96
Most Important Points
  • Do not delete things
  • Keep all old versions of your scripts by
    including the date in the filename (or using
    source-control)
  • Version output files by date
  • I have needed to go back to analyses done 7 years
    prior!
  • Make regular (weekly) backups
  • Try to pass this work off to professional
    sysadmins
  • External hard-drives/USBs are okay if you cannot
    get access to network drives, but try to automate

97
Course Overview
  • Lecture 1 What is Statistics? Introduction to R
  • Lecture 2 Univariate Analyses I continuous
  • Lecture 3 Univariate Analyses II discrete
  • Lecture 4 Multivariate Analyses I specialized
    models
  • Lecture 5 Multivariate Analyses II general
    models
  • Lecture 6 Sequence Analysis
  • Lecture 7 Microarray Analysis I Pre-Processing
  • Lecture 8 Microarray Analysis II
    Multiple-Testing
  • Lecture 9 Machine-Learning
  • Final Exam (written)
Write a Comment
User Comments (0)
About PowerShow.com