Title: Pabio590B
1Pabio590B week 1Microarrays
- Overview
- Design hybridization
- Data analysis
2Overview
- Affix/synthesize probes of known sequence to chip
- Hybridize with labeled sample
- Quantify level of hybridization to each probe
- Normalization
- Statistics
- Clustering more
3Experiments you might do
Measure RNA expression Changes in gene
expression over time / lifecycle Compare
differences between tissues/cell types
Comparisons between species/strains/conditions Wh
ole genome transcript mapping (tiling arrays)
Measure DNA content Presence or absence of
region Copy number via Comparative Genomic
Hybridization SNP Genotyping/Re-sequencing
Other ChIP on chip arrays RIP on chip
4Microarray Design
- Affix/synthesize probes of known sequence to chip
- Hybridize with labeled sample
- Quantify level of hybridization to each probe
- Normalization
- Statistics
- Clustering more
5RNA Expression Chip Designs
- Expression Array
- - N number of probes per gene of interest
- - Trade-off between accuracy and number of
features - Tiling array
- - Place probe of X nt every Y bases
- - Biased vs unbiased
6Probe considerations
- Number of probes per region of interest
- Specificity of probes
- Distance between probes (tiling)
- Mismatch probes (Affymetrix)
7Hybridization
- Affix/synthesize probes of known sequence to chip
- Hybridize with labeled sample
- Quantify level of hybridization to each probe
- Normalization
- Statistics
- Clustering more
8Two-color vs One-color
- Two-color
- Two samples one each slide
- cy3 - green - 532nm
- cy5 - red - 635nm
- One-color
- One sample per slide
- cy3
- No significant difference in accuracy or
reproducibility
9Designs for Two-color Array
10Data Normalization
- Affix/synthesize probes of known sequence to chip
- Hybridize with labeled sample
- Quantify level of hybridization to each probe
- Normalization
- Statistics
- Clustering more
11Within-Array Normalization
Lowess Normalization
Cy3/Cy5
Signal intensity
Before
After
12Between-Array Normalization
- RNA Spike-in
- Random Probes
- Median Scaling
- Quantile Scaling
Median and quantile normalization are predicated
upon the arrays in question having the same
distribution. That is to say, if you can safely
assume that the bulk of genes have the same
expression across the arrays, only then you can
use those methods.
13Quantile Normalization
Before
After
14Statistical Analysis
- Affix/synthesize probes of known sequence to chip
- Hybridize with labeled sample
- Quantify level of hybridization to each probe
- Normalization
- Statistics
- Clustering more
15Some Advice About Statistics
- Dont get too hung up on p-values or any other
stat. - Ultimately what matters is biological relevance
and external knowledge and other heterogeneous
measures (related functions, pathways, other data
types) that are not easily measured by statistics
alone. - P-values should help you evaluate the strength of
the evidence, rather than being used as an
absolute yardstick of significance. - Statistical significance is not necessarily the
same as biological relevance and vice-versa.
John Quackenbush
16Is this gene differentially expressed between the
two conditions?
17To rephrase the question
- Is the mean probe value different between Samples
A B - Null Hypothesis H0 means are the same
- Alternate Hypothesis Ha means are different
18What affects our ability to test the hypothesis?
- Difference in means
- Number of sample points
- Standard deviations of sample
19The T-statistic
- Directly proportional to difference in means
- Inversely proportional to standard deviation
- Directly proportional to sample size
The T-test calculates how likely the T-statistic
is, given the null hypothesis that the means are
actually the same.
20T-statistic and P-values
- P-values can be determined from theoretical
distributions or permutation testing - Theoretical distributions rely on a set of
assumptions that array experiments do not
necessarily follow - Permutation tests do not rely on any assumptions
21Permutation Testing
1) Permute n times by random shuffling 2)
Calculate T-statistic for each permutation 3)
Calculate probability of original T-statistic
22Interpreting P-values
- T-test tests the null hypothesis that sample
means are equal - Gene X has p-value of 5 from T-test
- 95 chance it is differentially expressed
- 5 chance that is NOT differentially expressed
- ? False Positive Rate 5
23T-Test Refinements
- Equal vs unequal variance of samples
- Equal vs unequal sample size
- Dependant vs independent samples
- CAVEAT
- As sample sizes get smaller, the validity of
p-values calculated via permutation diminishes. - Microarrays typically have few probes per gene,
so sample size is smallish.
24Multiple Testing Problem
- If there is a 5 chance of false positives in one
experiment, what happens when we are testing
10,000 genes. - The majority of those genes are not
differentially expressed, but - a 5 p-value means we will have 500
false-positives.
25Family-Wise Error Rate (FWER)
FWER is the probability of making one or more
false discoveries (type I errors) among all the
hypotheses when performing multiple pair-wise
tests.
- One comparison FWER p-value
- 10,000 comparisons FWER 1.0
That means that when making 10,000 comparisons
you are sure to make at least one error.
26Bonferroni Correction
- What if you want to keep the FWER at 5
- 0.05 / 10,000 0.000005 5e-6
- Only those genes with T-test p-value of lt 5xe-6
are called differentially expressed - Leads to experiment-wide ? of 0.05
The Standard Bonferroni correction is considered
very conservative
27Adjusted Bonferroni
- Rank all genes by ascending order of p-value
- Assign gene with smallest p-value a corrected
p-value of ? / N (0.5/10,000) - Assign gene with second smallest p-value a
corrected p-value of ? / N-1 - Etc
The Adjusted Bonferroni correction is less
conservative
28False Discovery Rate
- Measures the likely number of false positives
amongst discovered genes - Factors affecting FDR
- Proportion of actual differentially expressed
genes - Distribution of the true differences
- Measurement variability
- Sample size
29Analysis of Variance (ANOVA)
- Microarray testing across 3 conditions
- Is a gene expressed equally across all
conditions? - F-ratio for given gene X
- (variability within conditions) / (variability
across conditions) - Calculate p-value
- Look up probability of F-ratio
- Determine probability by permutation testing
30Significance Analysis of Microarrays (SAM)
- Gene-specific T-tests
- Computes statistic (dj) for each gene j
- measures the relationship between gene expression
and a response variable - describes and groups the data based on
experimental conditions - uses non-parametric statistics
- repeated permutations are used to determine FDR
- Accounts for correlations in genes and avoids
parametric assumptions about the (normal vs
non-normal) distribution of individual genes
31Clustering
- Affix/synthesize probes of known sequence to chip
- Hybridize with labeled sample
- Quantify level of hybridization to each probe
- Normalization
- Statistics
- Clustering more
32Why do clustering?
- Identify groups of possibly co-regulated genes
(e.g. so you can look for common sequence motifs) - Identify typical temporal or spatial gene
expression patterns (e.g. cell-cycle data) - Arrange a set of genes in a linear order that is
at least not totally meaningless
33Can also cluster experiments
- Quality control
- detect bad/outlying experiments
- Identify or categorize classes of biological
samples - sorting by tumor sub-type
34How you cluster?
- Define a distance measure
- Group genes (or experiments) based on that measure
Objects are placed into groups. Objects within a
group are more similar to each other than objects
across groups.
In some cases groups are hierarchically organized
based on the intra-group similarity
35Distance Metrics
Correlation
Euclidean
Correlation (X,Y) 1 Distance (X,Y)
4 Correlation (X,Z) -1 Distance (X,Z)
2.83 Correlation (X,W) 1 Distance (X,W)
1.41
36Clustering considerations
- Euclidean clustering
- Magnitude direction
- 2 conditions
- Correlation clustering
- Direction only
- 3 conditions
Array data is noisy, so you probably need
multiple data points per condition
- Clustering methods
- Hierarchical
- Partitional
- Other
37Hierarchical clustering
Agglomerative, bottom-up method
- Initial state
- - each item is a cluster
- Iterate
- - join two most similar cluster
- Stop
- - when number of clusters reaches user-defined
value
38Linkage methods
Ways to determine cluster similarity
Single Link Similarity of two most similar
members
Complete Link Similarity of two most similar
members
Average Link Average similarity of all members
39Comparing linkage methods
40Partitional (K-means) clustering
Divisive, top-down method
- Partition data into K random clusters
- Assign each point to nearest cluster
- Calculate centroid of each cluster
- GOTO step 2
41Other methods
- Support Vector Machines (SVM)
- K-nearest Neighbor (KNN)
- Self Organizing Maps (SOM)
- Self Organizing Tree Algorithm (SOTA)
- Cluster Affinity Search Technique (CAST)
- QT Cluster (QTC)
- Discriminant Analysis Classifier (DAM)
- Principal Component Analysis (PCA)
- Etc.
42Warnings and Limitations
- Clusters are like statistics
- Ideally they mirror reality, but they should
only be taken seriously in conjunction with
confirmatory data from other sources. - Clustering software clusters things
- If you tell it to find 4 clusters, it will find
4 clusters in anything! - Garbage In, Garbage Out
- Clustering typically relies on a set of input
parameters that can be hard to evaluate except
for empirically evaluating the outputs for a
given set of input parameters.
43Clusters Interpretation - EASE (Expression
Analysis Systematic Explorer)
Population Size 40 genes Cluster size 12
genes 10 genes, shown in green, have a common
biological theme and 8 occur within the cluster
44Microarray Analysis Software
- TIGR MEV
- Limma
- SAM
- EDGE
- These software packages are free and open-source
- Each has different strengths/weaknesses and makes
different assumptions about your data
45 Analysis Platforms
- Gene Sifter
- Rosetta Resolver
- Bio Discovery
46Microarray Data Sources
- Gene Expression Omnibus (NCBI)
- ArrayExpress (EBI)
- Stanford Microarray Database
- Yale Microarray Database
47Microarray Data Standards
- Microarray Gene Expression Data Society (MGED)
- MIAME
- MAGE - OM
- MAGE ML
- RNA Abundance Database (RAD)
- Integrating data from various types of expression
experiments