Title: Analysis of Large Scale Gene Expression Data
1Analysis of Large Scale Gene Expression Data
- John Quackenbush
- September 2005
2Microarray Analysis
- General microarray overview
- How do platforms compare?
- Should I subtract background?
- TM4 Array Solutions
- Annotating arrays
- How complete is the genome?
- Tools for array analysis
- Experimental design
- Normalization
- What do we measure?
- Some concepts from statistics
- Distributions
- Randomization Tests
- Multiple Testing
- Finding significant genes
- Fold Change
- T-tests
- ANOVA
- SAM
- Hierarchical clustering
- Bootstrapping
- Jack-knifing
- Self-organizing trees
- K-means/k-medians Clustering
- Self-organizing maps
- Template matching
- Gene shaving
- Relevance networks
- Principal Components Analysis
- Support Vector Machines (SVM)
- k-Nearest Neighbors
- EASE Meta-Analysis
- Demo Data Set 1
3General Microarray Analysis
4Levels of Biological Information DNA mRNA Prote
ins Informational Pathways Informational
Networks Cells Organs Individuals Populations
Ecologies
Genomics Functional Genomics Proteomics Metabolo
mics Systems Biology Cellular Biology Medicine Med
icine Genetics Ecology
The Future!
5February 2001 Completion of the Draft Human
Genome
April 14, 2003 The Human Genome is completed
again!
October 2004 The Human Genome is now really
finished!
But what does finished mean???
6Where are the pressing questions?
- Can we find the genes and assign them functions?
- Can we predict protein structures and functions?
- Can we reconstruct metabolic, signaling, and
other pathways? - Can we reconstruct informational networks?
- Can we link genotype to phenotype?
- Can we use genotype/phenotype to predict relevant
outcome? - Can we use cross-species comparisons to learn
something?
7The Beast Microarray Robot from Intelligent
Automation lthttp//www.ias.comgt
8Microarray Overview I
9Microarray Overview II
10Microarray Overview II
11Affymetrix GeneChip Expression Analysis
Generate DNA Sequence
Design and synthesize chips
ACGTAGCTAGCTGATCGTAGCTAGCTAGCTAGCTGATC
ACGTAGCTAGCTGATCGTAGCTAGCTAGCTAGCTGATC
12Affymetrix GeneChip Expression Analysis
13Microarray Expression Analysis
Gene
1432,448 element mouse array
Thanks to M. Ko (NIA) and B. Soares (BMAP)
kidney vs. heart 15?g total RNA
Shuibang Wang, Yan Yu, Renee Gaspard
15Steps in the Process
- Select array elements and annotate them
- Build a database to manage stuff
- Print arrays and manage the lab
- Hybridize and analyze images manage data
- Analyze hybridization data and get results
16What can I do with arrays?
17Types of Experiments
- Class Comparison
- Can I find genes that distinguish between two
classes, such as tumor and normal? - Class Discovery
- Given what I think is a uniform group of samples,
can I find subsets that are biologically
meaningful? - Classification
- Given a set of samples in different classes, can
I assign a new, unknown sample to one of the
classes? - Mechanistic Studies
- Can I discover a causative mechanism associated
with the distinction between classes?
18How do platforms compare?
19Microarray Platform Comparisons
Platform comparisons may result in little to no
replication of gene expression levels. Findings
may replicate within a platform (top), but not
between platforms (bottom). Spotted long (80mer)
oligos versus Affymetrix 25mer chip.
Rogojina et al. 2003. Use of Affymetrix to
spotted oligonucleotide microarrays using two
retinal epithelial cell lines. Molecular Vision
9482-96
20Experimental Design
Gene expression in Angiotensin II-induced
hypertension acute versus chronic effects
Saline Ang II
- Two Factors
- Factor 1 Treatment saline or Angiotensin II
- Factor 2 Time
- acute (24 h) or chronic (14 day)
acute chronic
Initial RNA samples 1.5 ug total RNA Total RNA
was amplified (TIGR SOP M022) to generate
antisense cRNA cRNA yield 40 - 60 ug from
amplification
Jennie Larkin, Harry Gavras thanks to Affymetrix
21Experimental Design
Total RNA cRNA aminoallyl labelled cDNA 1.5
ug 50 ug
amplification
TIGR Microarray Protocol
Affymetrix GeneChip Protocol
Cy3/Cy5 dye coupling and hybridization Wash and
scan
Second cycle small sample protocol Biotinylated
cRNA Fragmentation Wash and stain Scan
DATA
Jennie Larkin, Harry Gavras thanks to Affymetrix
22Affymetrix TIGR shared data
- 11,714 TCs were present on both platforms
- 5854 had values in gt 80 of experiments
- 92 (N 5358) did not statistically differ
between platforms (2-factor ANOVA, a .05)
Jennie Larkin, Harry Gavras thanks to Affymetrix
23How do platforms compare?
Jennie Larkin, Harry Gavras thanks to Affymetrix
24Two platforms similar values
- Both OSF2 and procollagen type III alpha 1 were
up-regulated in chronic Ang II treatment. - Even the pattern of variability between
biological replicates was reproduced between
platforms.
Jennie Larkin, Harry Gavras thanks to Affymetrix
25Two platforms similar values
- Both ADAMTS1 and ADRP were up-regulated in acute
Ang II treatment. - Again, both the values and the pattern of
variability between biological replicates were
reproduced between platforms.
Jennie Larkin, Harry Gavras thanks to Affymetrix
26Not all is perfect
There were differences between the arrays, but
these were the exception rather than the
rule. Metallothionein-II was only up-regulated
in acute on the TIGR arrays, but was up-regulated
in both acute and chronic on the Affy chips.
Jennie Larkin, Harry Gavras thanks to Affymetrix
27When Platforms Agree
28When Platforms Disagree, They Disagree
29General Microarray Strategy
- Choose an experimentally interesting and
tractable model system - Design an experiment with comparisons between
related variants - Include sufficient biological replication to make
good estimates - Hybridize and collect data
- Normalize and filter
- Mine data for biological patterns of expression
- Integrate expression data with other ancillary
data such, including genotype, phenotype, the
genome, and its annotation
30Should I subtract background?
31Non-normalized, background subtracted data
100
32Annotating andComparing Arrays
33TIGR Gene Indices home page
www.tigr.org/tdb/tgi
83 species gt26,000,000 sequences
34The Gene Indices home page
biocomp.dfci.harvard.edu/tgi/tgipage.html
85 species gt28,000,000 sequences
35TGICL Tools are available with more coming
36Gene Index Assembly process
37A TC Example
38GO Terms and EC Numbers
Babak Parvizi
39Everything is mapped in the context of the
relevant genome
40The TIGR Gene Indices lthttp//www.tigr.org.tdb/tdb
/tgigt
Dan Lee, Ingeborg Holt
41Building TOGs Reflexive, Transitive Closure
Thanks to Woytek Makalowski and Mark Boguski
42TOGA An Sample Alignment bithoraxoid-like
protein
43(No Transcript)
44RESOURCERER
Jennifer Tsai
http//pga.tigr.org/tools.shtml
45RESOURCERER An Example
46RESOURCERER Using Genetic Markers
Now Available QTL-based searches, promoter
region extraction, GO-Slim analysis
47Just added
48The Features are the same
49How Complete isthe Human Genome?
50is easy!
Gene Finding in Humans
Razvan Sultana
51is easy?
Gene Finding in Humans
Razvan Sultana
52is difficult?
Gene Finding in Humans
Razvan Sultana
53is difficult?
Gene Finding in Humans
A genome and its annotation is only a hypothesis
that must be tested.
Razvan Sultana
54Select TIGRHuman TCsnot mappingto human
genome(about 9000)
Razvan Sultana Laura Beaver Bryan Frank
55Answer 61 Positives!
SequenceValidationUnderway Approximately50
appear good
Razvan Sultana Laura Beaver Bryan Frank
56Tools for Array Analysis
57TM4 Resources
TM4 Website
- Application Downloads
- Documentation and FAQs
- And Much More!
http//www.tm4.org
CD Contents
- All TM4 Applications
- User Manuals
- Supplementary Documentation
- Sample Data Sets
TM4 Reference
Saeed, A.I., Sharov, V., White, J., Li, J.,
Liang, W., Bhagabati, N., Braisted, J., et al.
2003. TM4 A Free, Open-Source System for
Microarray Data Management and Analysis.
BioTechniques 34 374-378
58Microarray Data Flow
Scanner
Printer
.tiff Image File
Image Analysis
Raw Gene Expression Data
Gene Annotation
Normalization / Filtering
Normalized Data with Gene Annotation
Expression Analysis
Interpretation of Analysis Results
59MAD Database Schema
60MADAM Microarray Data Manager
MAGE-ML exportnow functional
Joseph White Jerry Li Alexander Saeed Vasily
Sharov Syntek Inc.
? Available with OSI source and MySQL
61MAGE-ML Export with Madam
62MIDAS Data Analysis
Wei Liang
Variance Stabilization, Adding Error
Models, MAANOVA, Automated Reporting
? Available with source
63MeV Data Mining Tools
Alexander Saeed Alexander Sturn Nirmal
Bhagabati John Braisted Syntek Inc. Datanaut,
Inc.
Now with scriptingand the ability to save state
andrestart analysis
? Available with OSI source www.tm4.org
64Microarray Data Flow
Scheduler (Machine Scheduling)
SliTrack (Machine Control)
.tiff Image File
PCR Score
MABCOS (Barcode System)
Exp Designer
Spotfinder (Image Analysis)
MADAM (Data Manager)
Expression Data
Raw .tav File
Miner (.tav File Creator)
Raw .tav File
MIDAS (Normalization)
GenePix Converter
Normalized .tav File
Query Window
MeV (Data Analysis)
Interpretation
65Protocols and Methods
66Publications on Microarray Tools and Techniques
67Publications on Microarray Data Analysis and
Design
68Publications on Microarray Data Exchange Standards
MIAME Standards Nature family, Cell family, EMBO
reports, Bioinformatics,Genome Research, Genome
Biology, Science, The Lancet,Science, and
others.
69The Starting Point Designing the Experiment
70The Experimental Design
- The Experimental Design dictates a good deal of
what you can do with the data - Good normalization and processing reflects the
experimental design - The design also facilitates certain comparisons
between samples and provides the statistical
power you need for assigning confidence limits to
individual measurements - The design must reflect experimental reality
- The most straight-forward designs compare
expression in two classes of samples to look for
patterns that distinguish them.
71Sample Pairing for Co-Hybridization Experiments
Direct Comparison with Dye Swap
A1
B1
A2
B2
A3
B3
A4
B4
A1
B1
A2
A3
B2
B3
A4
B4
- RNA sample is not limiting (e.g. plenty of
sample) - Flip dyes account for any gene-dye effects
Balanced Block Design
A1
B1
A2
B2
A4
B4
A3
B3
- RNA sample is limiting
- Balanced blocking accounts for any gene-dye
effects
72Multiple Sample Pairings
Reference Design (Indirect Comparison)
- More than two samples are compared
- (e.g. tumor classification, time course)
- Flip dyes are not necessary but can be done to
increase precision - Ratio values are inferred (indirect)
- Suited for cluster analysis need common
reference
Loop Design
73Why perform flip-dye experiments?
74Loops and Reference Designs
23 Hybs
10 hybs
Standard flip-dye expt
S. Wang , K. Kerr, J. Quackenbush, G. Churchill
75Loops and Reference Designs
Both approaches can give equivalent results
S. Wang , K. Kerr, J. Quackenbush, G. Churchill
76Loop vs. Reference Designs
- Loop design
- Can provide direct measurements
- Give more data on each experimental sample
with the same number of hybs - Require more RNA per sample
- Can unwind with a bad sample or for a gene
with bad data - Reference design
- Easily extensible
- Simple interpretation of all results
- Requires less RNA per sample
- Less sensitive to bad RNA samples and bad
array elements
77Experimental Design
A1
A2
A3
A4
B1
B2
B3
B4
Keep it simple!
78One Possible Experimental Paradigm
Examining Genotype, Phenotype, and Environment
Parental - stressed
Derived - stressed
Parental - unstressed
Derived - unstressed
79Basic Design Principles
- Biological replicas are more informative than
correlated replicas (independent RNA, independent
slides) - More replicas are better higher statistical
power - For loops, hybridizations of individual samples
should be balanced (as many Cy3 as Cy5
labelings) - Self-self hybs add data on reproducibility and
can be used to produce error models - At a minimum, should use dye swap replicates to
compensate for any dye biases in labeling or
detection
80How Many Replicates?
(Simon et al., Genetic Epidemiology 23 21-36,
2002)
n 4(za/2 zb)2 / (d/1.4s)2
Where za/2 and zb are normal percentile values at
significance level a and false negative rate b
parameter d represents the minimum detectable
log2 ratio and s represents the SD of log ratio
values. For a 0.001 and b 0.05, then za/2
-3.29 and zb -1.65. Assume d 1.0 (2-fold
change) and s 0.25, Therefore n 12 samples
(6 query and 6 control).
81Normalization
82Why Normalize Data?
- Goal is to measure ratios of gene expression
levels (ratio)i Ri/Giwhere Ri/Gi are,
respectively , the measured intensities for the
ith spot. - In a self-self hybridization, we would expect all
ratios to be equal to one Ri/Gi 1 for all i.
But they may not be. - Why not?
- Unequal labeling efficiencies for Cy3/Cy5
- Noise in the system
- Differential expression
- Normalization brings (appropriate) ratios back to
one.
83The Starting Point The Ratio
84Log2(ratio) measures treat up- and down-regulated
genes equally
log2(1) 0 log2(2) 1 log2(1/2) -1
85Normalization Approaches A variety exist
- Total Intensity
- Linear Regression
- Ratio statistics described by Chen, Dougherty,
Bittner - J. Biomed. Optics (1997) 2(4) 364-374
- Iterative log(ratio) mean centering
- Lowess Correction
- And others
- Any of these using
- Entire Data Set
- User-defined Data Set/Controls
86Normalization Approaches
Using the Entire Data Set
- Probe Quantification less important
- No assumption on which genes constitute
- housekeeping set
- Uses all the data
- No independent confirmation
User-defined Data Set/Controls
- Requires definition of housekeeping set
- or good added controls
- Requires good RNA quantitation
- Ignores much data
87Normalization Approaches
The Solution(?)
- The best technique is experiment dependent
- A good approach is to use a combination of
techniques - All analysis methods depend on an
intelligent Experimental design
88Resource A. thaliana DNA Clones for Spiking
- chlorophyll a/b binding protein (Cab)
- RUBISCO activase (RCA)
- ribulose-1,5-bisphosphate carboxylase/oxygenase
(RbcL) - lipid transfer protein 4 (LTP4)
- lipid transfer protein 6 (LTP6)
- papain-type cysteine endopeptidase (XCP2)
- root cap 1 (RCP1)
- NAC1
- triosphosphate isomerase (TIM)
- ribulose-5-phosphate kinase (PRKase)
SP6 Transcription Start
5ATTTA GGTGA CACTA TAGAA TACAA GCTTG GGCTG
CAGGT CGACT CTAGA GGATC CCCGG GCGAG CTCCC
AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA CCGAA TTC3
SP6 Promoter
HindIII
PstI
SatI AccI HincII
XbaI
EcoRI
SacI
AvaI SmaI
BamHi
Clone set available at lthttp//pga.tigr.orggt
89Resource B. subtillus DNA Clones for Spiking
- pGIBS-lys ATCC 87482
- pGIBS-phe ATCC 87483
- pGIBS-thr ATCC 87484
- pGIBS-trp ATCC 87485
- pGIBS-dap ATCC 87486
- Artificial polyA added to the 3end
Clone set available at lthttp//www.atcc.orggt
90Normalization Approaches Total Intensity
- Conceptually, this is the simplest approach
- Assumption Total RNA (mass) used is same for
- both samples.
- So, averaged across thousands of genes, total
- hybridization should be the same for both samples
91Before and After Normalization
92The Starting Point The R-I Plot
- Data exhibits an intensity-dependent structure
- Uncertainty in measurements is greater at lower
intensities - Uncertainty in ratio measurements generally
greater at lower intensities - Plot log2(R/G) vs. log2(RG)
- variation Terry Speeds M-A plot with
- (½ )log2(RG)
93Good Data
94Bad Data from Parts Unknown
Each pen group is colored differently
Gary Churchill
95Lowess Normalization
- Observations
- Intensity-dependent structure
- Data not mean centered at log2(ratio) 0
96LOWESS (Contd)
- Local linear regression model
- Tri-cube weight function
- Least Squares
97LOWESS Results
98Variance stabilization/regularization
- Measurements of expression vary between any two
assays - This can be affected by changes in the mean
expression level, but normalization can help
reduce those differences - However, the variance, or spread in the data, can
be quite different between replicates (or pen
groups) - Variance stabilization can rescale the data for
each experiment to make these more comparable
99A Box Plot can show the difference in
variancebetween replicates
100Standard Deviation Regularization
Let aij be the raw log ratio for the jth spot in
ith block (or slide)
aij be the scaled log ratio for the jth spot in
ith block (or slide)
where Nj denotes the number of genes ith block or
ith slide, M denotes the number of blocks or
slides, aij denotes the log ratio mean of ith
block (or ith slide)
101MIDAS Normalization Methods(Standard deviation
regularization)
Standard deviation regularization
Assumption log-ratio standard deviations within
each block or slide are the same.
Variance regularization can remove the bias
102There are Limits to what you can Measure
103The Limits of log-ratios The space we explore
104The Limits of log-ratios The space we explore
105The Limits of log-ratios The space we explore
106What do we measure?
107Dealing with Data
Before any pattern analysis can be done, one must
first normalize and filter the data.
Normalization facilitates comparisons between
datasets.
Filtering transformations can eliminate
questionable data and reduce complexity.
108Expression Elements
109Ratio vs. log-ratio
Ai Red intensity Bi Green intensity
Let
R
4
Gene1
3
log2(AB)
2
1
Gene2
0
AB
Advantages of log transformation Treat
up-regulated and down-regulated genes
symmetrically! Transfer multiplication
operations to addition operations! Because
110Expression Vectors
Gene Expression Vectors represent the expression
of a gene over a set of experimental conditions
or sample types.
111Multiple Samples?
- Goal is identify genes (or experiments) which
havesimilar patterns of expression - This is a problem in data mining
- Clustering Algorithms are most widely used
- Types
- Agglomerative Hierarchical
- Divisive k-means, SOMs
- Hybrid SOTA
- Nonclustering Principal Component Analysis
(PCA) - All depend on how one measures distance
112Expression Vectors
- Crucial concept for understanding clustering
- Each gene is represented by a vector where
coordinates are its values log(ratio) in each
experiment - x log(ratio)expt1
- y log(ratio)expt2
- z log(ratio)expt3
- etc.
113Expression Vectors
- Crucial concept for understanding clustering
- Each gene is represented by a vector where
coordinates are its values log(ratio) in each
experiment - x log(ratio)expt1
- y log(ratio)expt2
- z log(ratio)expt3
- etc.
- For example, if we do six experiments,
- Gene1 (-1.2, -0.5, 0, 0.25, 0.75, 1.4)
- Gene2 (0.2, -0.5, 1.2, -0.25, -1.0, 1.5)
- Gene3 (1.2, 0.5, 0, -0.25, -0.75, -1.4)
- etc.
114Expt 1 Expt 2 Expt 3 Expt 4 Expt 5 Expt 6
Expression Matrix
- These gene expression vectors of log(ratio)
values can be used to construct an expression
matrix
- Gene1 -1.2 -0.5 0 0.25 0.75
1.4 - Gene2 0.2 -0.5 1.2 -0.25 -1.0
1.5 - Gene3 1.2 0.5 0 -0.25
-0.75 -1.4 - etc.
- This is often represented as a red/green colored
matrix
115Expression Matrix
The Expression Matrix is a representation of data
frommultiple microarray experiments.
Each element is a log ratio, usually log 2
(Cy5/Cy3)
Black indicates a log-ratio of zero, i.e., Cy5
and Cy3 are very close in value
Green indicates a negative log-ratio, i.e., Cy5 lt
Cy3
Gray indicates missing data
Red indicates a positive log ratio, i.e, Cy5 gt
Cy3
116Expression Vectors As Points inExpression Space
Similar Expression
Experiment 3
Experiment 2
Experiment 1
117Distance metrics
- Distances are measured between expression
vectors - Distance metrics define the way we measure
distances - Many different ways to measure distance
- Euclidean distance
- Pearson correlation coefficient(s)
- Manhattan distance
- Mutual information
- Kendalls Tau
- etc.
- Each has different properties and can reveal
different features of the data
118Distance and Similarity
The ability to calculate a distance (or
similarity - its inverse) between two expression
vectors is fundamental to clustering
algorithms Distance between vectors is the
basis upon which decisions are made when grouping
similar patterns of expression Selection of a
distance metric defines the concept of distance
for a particular experiment
119Distance a measure of similarity between genes
Some distances (MeV provides 11 metrics)
120Distance Is Defined by a Metric
121Distance is Defined by a Metric
1.4
-0.90
4.2
-1.00
122 Gene1 Gene2 Gene3 Gene4 Gene5 Gene6
Distance Matrix
- Once a distance metric has been selected, the
starting point for all clustering methods is a
distance matrix
- Gene1 0 1.5 1.2 0.25
0.75 1.4 - Gene2 1.5 0 1.3 0.55
2.0 1.5 - Gene3 1.2 1.3 0 1.3
0.75 0.3 - Gene4 0.25 0.55 1.3 0 0.25
0.4 - Gene5 0.75 2.0 0.75 0.25
0 1.2 - Gene6 1.4 1.5 0.3 0.4
1.2 0 - The elements of this matrix are the pair-wise
distances. Note that the matrix is symmetric
about the diagonal.
123MeV Data Mining Tools
Alexander Saeed Alexander Sturn Nirmal
Bhagabati John Braisted Syntek Inc. Datanaut,
Inc.
? Available with OSI source
124Some Concepts from Statistics
125Probability distributions
- The probability of an event is the likelihood
of its occurring. - It is sometimes computed as a relative frequency
(rf), where
The probability of an event can sometimes be
inferred from a theoretical probability
distribution, such as a normal distribution.
126Normal distribution
127- Less than a 5 chance that the sample with mean
s came from Population 1 - s is significantly different from Mean 1 at the
p lt 0.05 significance level. - But we cannot reject the hypothesis that the
sample came from Population 2
128Probability and Expression Data
- Many biological variables, such as height and
weight, can reasonably be assumed to approximate
the normal distribution. - But expression measurements? Probably not.
- Fortunately, many statistical tests are
considered to be fairly robust to violations of
the normality assumption, and other assumptions
used in these tests. - Randomization / resampling based tests can be
used to get around the violation of the normality
assumption. - Even when parametric statistical tests (the ones
that make use of normal and other distributions)
are valid, randomization tests are still useful.
129Outline of a randomization test - 1
1. Compute the value of interest (i.e., the
test-statistic s) from your data set.
2. Make fake data sets from your original
data, by taking a random sub-sample of the data,
or by re-arranging the data in a random fashion.
Re-compute s from the fake data set.
130Outline of a randomization test - 2
3. Repeat step 2 many times (often several
hundred to several thousand times) and record of
the fake s values from step 2 4. Draw
inferences about the significance of your
original s value by comparing it with the
distribution of the randomized (fake) s values
131Outline of a randomization test - 3
- Rationale
- Ideally, we want to know the behavior of the
larger population from which the sample is drawn,
in order to make statistical inferences. - Here, we dont know that the larger population
behaves like a normal distribution, or some
other idealized distribution. All we have to work
with are the data in hand. - Our fake data sets are our best guess about
this behavior (i.e., if we had been pulling data
at random from an infinitely large population, we
might expect to get a distribution similar to
what we get by pulling random sub-samples, or by
reshuffling the order of the data in our sample)
132The problem of multiple testing (adapted from
presentation by Anja von Heydebreck,
MaxPlanckInstitute for Molecular Genetics,
Dept. Computational Molecular Biology, Berlin,
Germany http//www.bioconductor.org/workshops/Heid
elberg02/mult.pdf)
- Lets imagine there are 10,000 genes on a chip,
and - none of them is differentially expressed.
- Suppose we use a statistical test for
differential expression, where we consider a gene
to be differentially expressed if it meets the
criterion at a p-value of p lt 0.05.
133The problem of multiple testing 2
- Lets say that applying this test to gene G1
yields a p-value of p 0.01 - Remember that a p-value of 0.01 means that there
is a 1 chance that the gene is not
differentially expressed, i.e., - Even though we conclude that the gene is
differentially expressed (because p lt 0.05),
there is a 1 chance that our conclusion is
wrong. - We might be willing to live with such a low
probability of being wrong - BUT .....
134The problem of multiple testing 3
- We are testing 10,000 genes, not just one!!!
- Even though none of the genes is differentially
expressed, about 5 of the genes (i.e., 500
genes) will be erroneously concluded to be
differentially expressed, because we have decided
to live with a p-value of 0.05 - If only one gene were being studied, a 5 margin
of error might not be a big deal, but 500 false
conclusions in one study? That doesnt sound too
good.
135The problem of multiple testing 4
- There are tricks we can use to reduce the
severity of this problem. - They all involve slashing the p-value for each
test (i.e., gene), so that while the critical
p-value for the entire data set might still equal
0.05, each gene will be evaluated at a lower
p-value. - Well go into some of these techniques later.
136The problem of multiple testing 5
- Dont get too hung up on p-values.
- Ultimately, what matters is biological
relevance. - P-values should help you evaluate the strength of
the evidence, rather than being used as an
absolute yardstick of significance. - Statistical significance is not necessarily the
same as biological significance.
137Finding Significant Genes
- Assume we will compare two conditions with
multiple replicates for each class - Our goal is to find genes that are significantly
different between these classes - These are the genes that we will use for later
data mining
138Finding Significant Genes
- Average Fold Change Difference for each gene
- suffers from being arbitrary and not taking into
account systematic variation in the data
139Finding Significant Genes
- t-test for each gene
- Tests whether the difference between the mean of
the query and reference groups are the same - Essentially measures signal-to-noise
- Calculate p-value (permutations or distributions)
- May suffer from intensity-dependent effects
140t-tests
141T-Tests (Between Subjects or unpaired) - 1
- Assign experiments to two groups, e.g., in the
expression matrix below, assign Experiments 1, 2
and 5 to group A, and experiments 3, 4 and 6 to
group B.
2. Question Is mean expression level of a gene
in group A significantly different from mean
expression level in group B?
142T-TEST - 2
3. Calculate t-statistic for each gene
4. Calculate probability value of the
t-statistic for each gene either from A.
Theoretical t-distribution OR B.
Permutation tests.
143T-TEST - 3
Permutation tests
i) For each gene, compute t-statistic
ii) Randomly shuffle the values of the gene
between groups A and B, such that the reshuffled
groups A and B respectively have the same number
of elements as the original groups A and B.
144T-TEST - 4
Permutation tests - continued
iii) Compute t-statistic for the randomized
gene iv) Repeat steps i-iii n times (where n is
specified by the user). v) Let x the number of
times the absolute value of the original
t-statistic exceeds the absolute values of the
randomized t-statistic over n randomizations. vi)
Then, the p-value associated with the gene 1
(x/n)
145T-TEST - 5
- 5. Determine whether a genes expression levels
are significantly different between the two
groups by one of three methods - Just alpha (a significance level) If the
calculated p-value for a gene is less than or
equal to the user-input a (critical p-value), the
gene is considered significant. - OR
- Use Bonferroni corrections to reduce the
probability of erroneously classifying
non-significant genes as significant. - B) Standard Bonferroni correction The user-input
alpha is divided by the total number of genes to
give a critical p-value that is used as above gt
pcritical a/N.
146T-TEST 6
5C) Adjusted Bonferroni i) The t-values for
all the genes are ranked in descending order.
ii) For the gene with the highest t-value, the
critical p-value becomes (a/N), where N is the
total number of genes for the gene with the
second-highest t-value, the critical p-value
will be (a/N-1), and so on.
147TTEST 1-class (or One-sample t-test) - 1
- Used to test if the the mean expression of a gene
over all experiments is different from a
hypothesized mean.
2. Question Is the mean of the values of a given
gene vector significantly different from a
hypothesized mean?
148TTEST- 1 Class - 2
3. Often, the hypothesized mean in gene
expression studies is zero, meaning that we are
looking for genes whose mean log2 ratio across
all experiments is significantly different from
zero, i.e., 4. Using 1-sample t-tests, we can
select genes which, on average, show differential
expression across all experiments (since genes
with no differential expression should have a
mean log2 ratio of zero across all expts). 5.
Calculate t-value, where Observed mean
of gene vector Hypothesized mean of gene
vector t ------------------------------------
------------------------------------------ Stand
ard error of the mean of the gene vector
149TTEST 1 class - 3
6. Calculate p-value from a theoretical
t-distribution, OR 7. By permutation 7a.
Randomly pick some elements of the gene vector,
and change their values,such that the new value
of the changed element is original value 2
x (original value - hypothesized mean)
(i.e., flip the elements deviation around the
hypothesized mean) Thus, if the original gene
values are and the hypothesized mean is
zero, then the randomized gene values could
be
150TTEST 1 class - 4
7b. Calculate t-value from the randomized
gene 7c. Repeat 7a and 7b as many times as
desired. If all permutations are chosen, then
every possible combination of elements in the
gene vector is chosen for flipping. 7d. The
p-value 1 (the proportion of times that the
original absolute t-value exceeds the randomized
absolute t-value over all the permutations
conducted). 8. If a genes p-value is less than
or equal to the user-specified critical p-value,
the genes mean expression over all experiments
is significantly different from the hypothesized
mean. 9. Bonferroni and adjusted Bonferroni
corrections may be applied just as in the
two-sample t-test.
151Finding Significant Genes
- Volcano Plots
- Combines p-values and fold change measures
- Significant genes appear in upper corners
152Finding Significant Genes
- Analysis of Variation (ANOVA)
- Which genes are most significant for separating
classes of samples? - Calculate p-value (permutations or distributions)
- Reduces to a t-test for 2 samples
- May suffer from intensity-dependent effects
153One Way Analysis of Variance (ANOVA)
- Assign experiments to gt 2 groups
2. Question Is mean expression level of a gene
the same across all groups?
154ANOVA - 2
3. Calculate an F-ratio for each gene,
where Mean square (groups) F
----------------------------------, which is a
measure of Mean square (error) Between
groups variability -----------------------------
--------- Within groups variability The larger
the value of F, the greater the difference among
the group means relative to the sampling error
variability (which is the within groups
variability). i.e., the larger the value of F,
the more likely it is that the differences among
the group means reflect real differences among
the means of the populations they are drawn
from, rather than being due to random sampling
error.
155 ANOVA - 3 4. The p-value associated with an
F-value is the probability that an F-value that
large would be obtained if there were no
differences among group means (i.e., given the
null hypothesis). Therefore, the smaller the
p-value, the less likely it is that the null
hypothesis is valid, i.e., the differences among
group means are more likely to reflect real
population differences as p-values decrease in
magnitude.
156- ANOVA - 4
- 5. P-values can be obtained for the F-values from
a theoretical F-distribution, assuming that the
populations from which the data are obtained - are normally distributed, and
- have homogeneous variances.
The test is considered robust to violations of
these assumptions, provided sample sizes are
relatively large and similar across groups.
157 ANOVA 5 6. P-values can be obtained from
permutation tests (just like in t-tests), if one
does not want to rely on the assumptions needed
for using the F-distribution. P-values can also
be corrected for multiple comparisons (using
Bonferroni or other procedures).
158Two-factor ANOVA (TFA)
Can be used to find genes whose expression is
significantly different over two factors (e.g.,
sex and strain), as well as to look for genes
with a significant interaction for these two
factors.
Strain B
Strain C
Strain A
Male
Female
159TFA - 2
160TFA - 3
- Ideally, design should be balanced, i.e., equal
numbers of samples in each factor A factor B
combination. - If unbalanced, the analysis can still be
conducted, but F-tests will be somewhat biased.
May need to use smaller p-values. - Can have balanced designs with no replication
(see below). In this case, interaction cannot be
tested..
161Finding Significant Genes
- Significance Analysis of Microarrays (SAM)
- Uses a modified t-test by estimating and adding a
small positive constant to the denominator - Significant genes are those which exceed the
expected values from permutation analysis.
162SAM test Statistic
- di Score
- si Standard Deviation
- s0 Safety Factor
163SAM Variance Estimate
- Gene by gene variance estimate safety factor
- Variance equal in the two conditions
- s0 term is here to deal with cases when variance
estimates gets too close to zero
- How to choose s0?
- Test statistics are binned in 100 different group
depending on the si value - s0 is chosen so that the dispersion of the test
statistic does not vary from bin to bin - avoids aberrant values when variance estimates
close to 0
164SAM Hypothesis Testing
- Permutation technique
- Multiple testing adjustment technique
- False Discovery Rate
165Confidence Level False Discovery Rate
- Fix a threshold DELTA for differentially
expressed genes - For each permutation, count how many genes you
declare differentially expressed - NB In a permutation you should find 0 genes.
- Compute median number of falsely called genes in
permutations - False Discovery Rate is number of falsely called
genes divided by number of differential expressed
genes in original data
FDR percentage of NON-significant genes you can
expect to find in your result list
166SAM
- SAM gives estimates of the False Discovery Rate
(FDR), which is the proportion of genes likely to
have been wrongly identified by chance as being
significant. - It is a very interactive algorithm allows users
to dynamically change thresholds for significance
(through the tuning parameter delta) after
looking at the distribution of the test
statistic. - The ability to dynamically alter the input
parameters based on immediate visual feedback,
even before completing the analysis, helps make
the data-mining process sensitive.
167Significance analysis of microarrays (SAM)
- SAM can be used to identify significant genes
based on differential expression between sets of
samples. - Currently implemented for the following designs
- two-class unpaired
- two-class paired
- multi-class
- censored survival
- one-class
168SAM designs
- Two-class unpaired to pick out genes whose mean
expression level is significantly different
between two groups of samples (analogous to
between subjects t-test). - Two-class paired samples are split into two
groups, and there is a 1-to-1 correspondence
between an sample in group A and one in group B
(analogous to paired t-test).
169SAM designs - 2
- Multi-class identifies genes whose mean
expression is different across gt 2 groups of
samples (analogous to one-way ANOVA) - Censored survival finds genes whose expression
levels are correlated with duration of survival
using Cox-regression. - One-class selects genes whose mean expression
across experiments is different from a
user-specified mean.
170SAM Two-Class 1
- Assign experiments to two groups, e.g., in the
expression matrix below, assign Experiments 1, 2
and 5 to group A, and experiments 3, 4 and 6 to
group B.
2. Question Is mean expression level of a gene
in group A significantly different from mean
expression level in group B?
171SAM Two-Class 2
Permutation tests
i) For each gene, compute d-value (analogous to
t-statistic). This is the observed d-value for
that gene.
ii) Randomly shuffle the values of the gene
between groups A and B, such that the reshuffled
groups A and B respectively have the same number
of elements as the original groups A and B.
Compute the observed d-value for each randomized
gene
172SAM Two-Class 3
- iii) Repeat step (ii) many times, so that each
gene has many randomized d-values. Take the
average of the randomized d-values for each gene.
This is the expected d-value of that gene. - iv) Plot the observed d-values vs. the expected
d-values
173SAM Two-Class 4
Significant positive genes (i.e., mean
expression of group B gt mean expression of
group A) in red
The more a gene deviates from the observed
expected line, the more likely it is to be
significant. Any gene beyond the first gene in
the ve or ve direction on the x-axis (including
the first gene), whose observed exceeds the
expected by at least delta, is considered
significant.
174SAM Two-Class 5
- For each permutation of the data, compute the
number of positive and negative significant genes
for a given delta as explained in the previous
slide. The median number of significant genes
from these permutations is the median False
Discovery Rate. - The rationale behind this is, any genes
designated as significant from the randomized
data are being picked up purely by chance (i.e.,
falsely discovered). Therefore, the median
number picked up over many randomizations is a
good estimate of false discovery rate.
175SAM Two-Class Paired
- Samples fall into two groups
- Each member of group A is associated with a
member of group B in a 1-to-1 relationship
A-B pair
176SAM Two-Class Paired - 2
- e.g., groups A and B could respectively
represent before and after a drug treatment,
and each A-B pair of samples could come from the
same patient before and after the treatment. - or, groups A and B could represent two strains
for which samples were collected at the several
time points over a time course study. A sample
collected from each of strain A and B at the same
time point could form an AB pair.
- The rest of the analysis is similar to two-class
unpaired SAM. Positive significant genes are
those for which Mean(Group B) is significantly
larger than Mean (Group A), and reverse is true
for negative significant genes
177SAM Multi-Class
- Extension of SAM two -class unpaired to more
than 2 groups - Experiments belong to one of at least three
groups - Analogous to one-way between subjects ANOVA
178SAM Multi-Class - 2
- This analysis yields only positive significant
genes - These are genes whose means are significantly
different across some combination of the groups
of experiments.
179SAM Censored Survival
- Each experiment (sample) is associated with an
observation time, and a state at the time of
observation. - The state is either dead or censored
- Censored means that the subject survived
beyond the time point at which the sample was
taken. - A positive score means that a higher expression
level for that gene implies shorter survival
(i.e., higher risk), whereas a negative score
means that higher expression implies longer
survival.
180Finding Patterns in the Data
181Hierarchical Clustering
1. Calculate the distance between all genes. Find
the smallest distance. If several pairs share the
same similarity, use a predetermined rule to
decide between alternatives.
2. Fuse the two selected clusters to produce a
new cluster that now contains at least two
objects. Calculate the distance between the new
cluster and all other clusters.
3. Repeat steps 1 and 2 until only a single
cluster remains.
4. Draw a tree representing the results.
182Hierarchical Clustering
(HCL-2)
183Hierarchical Clustering
(HCL-3)
184Hierarchical Tree
(HCL-4)
185Agglomerative Linkage Methods
- Linkage methods are rules or metrics that return
a value that can be used to determine which
elements (clusters) should be linked. - Three linkage methods that are commonly used are
- Single Linkage
- Average Linkage
- Complete Linkage
(HCL-6)
186Single Linkage
Cluster-to-cluster distance is defined as the
minimum distance between members of one cluster
and members of the another cluster. Single
linkage tends to create elongated clusters with
individual genes chained onto clusters. DAB
min ( d(ui, vj) ) where u Î A and v Î B for all
i 1 to NA and j 1 to NB
DAB
(HCL-7)
187Average Linkage
Cluster-to-cluster distance is defined as the
average distance between all members of one
cluster and all members of another cluster.
Average linkage has a slight tendency to produce
clusters of similar variance. DAB 1/(NANB) S
S ( d(ui, vj) ) where u Î A and v Î B for all
i 1 to NA and j 1 to NB
DAB
(HCL-8)
188Complete Linkage
Cluster-to-cluster distance is defined as the
maximum distance between members of one cluster
and members of the another cluster. Complete
linkage tends to create clusters of similar size
and variability. DAB max ( d(ui, vj) ) where
u Î A and v Î B for all i 1 to NA and j 1 to
NB
DAB
(HCL-9)
189Comparison of Linkage Methods
Average
Single
Complete
(HCL-10)
190Bootstrapping
Bootstrapping resampling with replacement
Original expression matrix
Various bootstrapped matrices (by experiments)
191Jackknifing
Jackknifing resampling without replacement
Original expression matrix
Various jackknifed matrices (by experiments)
192Analysis of bootstrapped and jackknifed support
trees
- Bootstrapped or jackknifed expression matrices
are created many times by randomly resampling the
original expression matrix, using either the
bootstrap or jackknife procedure. - Each time, hierarchical trees are created from
the resampled matrices. - The trees are compared to the tree obtained from
the original data set. - The more frequently a given cluster from the
original tree is found in the resampled trees,
the stronger the support for the cluster. - As each resampled matrix lacks some of the
original data, high support for a cluster means
that the clustering is not biased by a small
subset of the data.
193Self Organizing Tree Algorithm
SOTA - 1
- Dopazo, J. , J.M Carazo, Phylogenetic
reconstruction using and unsupervised growing
neural network that adopts the topology of a
phylogenetic tree. J. Mol. Evol. 44226-233,
1997. - Herrero, J., A. Valencia, and J. Dopazo. A
hierarchical unsupervised growing neural network
for clustering gene expression patterns.
Bioinformatics, 17(2)126-136, 2001.
194SOTA Characteristics
SOTA - 2
- Divisive clustering, allowing high level
hierarchical structure to be revealed without
having to completely partition the data set down
to single gene vectors - Data set is reduced to clusters arranged in a
binary tree topology - The number of resulting clusters is not fixed
before clustering - Neural network approach which has advantages
similar to SOMs such as handling large data sets
that have large amounts of noise
195SOTA Topology
SOTA - 3
Centroid Vector
ap
Members
196Adaptation Overview
SOTA - 4
- Each gene vector associated with the parent is
compared to the centroid vector of its offspring
cells. - The most similar cells centroid and its
neighboring cells are adapted using the
appropriate migration weights.
197- Following the presentation of all genes to the
system a measure of system diversity is used to
determine if training has found an optimal
position for the offspring. - If the system diversity improves (decreases)
then another training epoch is started otherwise
training ends and a new cycle starts with a cell
division.
SOTA - 5
198SOTA - 6
The most diverse cell is selected for division
at the start of the next training cycle.
199Growth Termination
SOTA - 7
Expansion stops when the most diverse cells
diversity falls below a threshold.
200SOTA - 8
Each training cycle ends when the overall tree
diversity stabilizes. This triggers a cell
division and possibly a new training cycle.
201K-Means/Medians Clustering 1
202K-Means/Medians Clustering 2
3. Calculate mean/median expression profile of
each cluster.
4. Shuffle genes among clusters such that each
gene is now in the cluster whose mean expression
profile (calculated in step 3) is the closest to
that genes expression profile.
5. Repeat steps 3 and 4 until genes cannot be
shuffled around any more, OR a user-specified
number of iterations has been reached.
k-means is most useful when the user has an a
priori hypothesis about the number of clusters
the genes should belong to.
203K-Means / K-Medians Support (KMS)
- Because of the random initialization of
K-Means/K-Means, clustering results may vary
somewhat between successive runs on the same
dataset. KMS helps us validate the clustering
results obtained from K-Means/K-Medians. - Run K-Means / K-Medians multiple times.
- The KMS module generates clusters in which the
member genes frequently group together in the
same clusters (consensus clusters) across
multiple runs of K-Means / K-Medians. - The consensus clusters consist of genes that
clustered together in at least x of the K-Means
/ Medians runs, where x is the threshold
percentage input by the user.
204Self-organizing maps (SOMs) 1
1. Specify the number of nodes (clusters)
desired, and also specify a 2-D geometry for the
nodes, e.g., rectangular or hexagonal
205SOMs 2
2. Choose a random gene, e.g., G9
3. Move the nodes in the direction of G9. The
node closest to G9 (N2) is moved the most, and
the other nodes are moved by smaller varying
amounts. The farther away the node is from N2,
the less it is moved.
206SOMs 3
4. Steps 2 and 3 (i.e., choosing a random gene
and moving the nodes towards it) are repeated
many (usually several thousand) times. However,
with each iteration, the amount that the nodes
are allowed to move is decreased.
5. Finally, each node will nestle among a
cluster of genes, and a gene will be considered
to be in the cluster if its distance to the node
in that cluster is less than its distance to any
other node.
207SOM Neighborhood Options
Gaussian Neighborhood
Bubble Neighborhood
radius
G7
G7
G8
G8
G9
G9
G10
G10
G11
G11
N1
N2
N1
N2
N3
N4
N3
N4
N5
N6
N5
N6
Some move, alpha is constant.
All move, alpha is scaled.
208Template Matching
- Template matching allows one to find expression
vectors which match a provided template - A template can be derived from
- a gene known to be central to the area of study
- a sample or set of samples of a particular type
- a cluster with a mean pattern of interest
- a pattern constructed to reveal trends based on
knowledge of the experimental design
209PTM-2
- Sometimes it is useful to identify elements that
have complementary patterns by selecting to use
the absolute value of r.
210Gene Shaving
Results in a series of nested clusters
Choose cluster of appropriate size as determined
by gap statistic calculation
Repeat until only one gene remains
Orthogonalize expression matrix with respect to
the average gene in the cluster and repeat
shaving procedure
211Gene Shaving
The final cluster contains a set of genes that
are greatly affected by the experimental
conditions in a similar way.
Create random permutations of the expression
matrix and calculate R2 for each
Compare R2 of each cluster to that of the entire
expression matrix
Choose the cluster whose R2 is furthest from the
average R2 of the permuted expression matrices.
212Relevance Networks
Set of genes whose expression profiles are
predictive of one another.
Can be used to identify negative correlations
between genes.
213Relevance Networks
The remaining relationships between genes define
the subnets
214Principal Components Analysis
- PCA simplifies the views of the data.
- Suppose we have measurements for each gene on
multiple experiments. - Suppose some of the experiments are correlated.
- PCA will ignore the redundant experiments, and
will take a weighted average of some of the
experiments, thus possibly making the trends in
the data more interpretable. - 5. The