Title: Microarray analysis
1Microarray analysis
- Quantitation of Gene Expression
- Expression Data to Networks
BIO520 Bioinformatics Jim Lund
2Microarray data
- Image quantitation.
- Normalization
- Find genes with significant expression
differences - Annotation
- Clustering, pattern analysis, network analysis
3Sources of Non-Biological Variation
- Dye bias differences in heat and light
sensitivity, efficiency of dye incorporation - Differences in the amount of labeled cDNA
hybridized to each channel in a microarray
experiment (Channel is used to refer to a
combination of a dye and a slide.) - Variation across replicate slides
- Variation across hybridization conditions
- Variation in scanning conditions
- Variation among technicians doing the lab work.
4Factors which impact on the signal level
- Amount of mRNA
- Labeling efficiencies
- Quality of the RNA
- Laser/dye combination
- Detection efficiency of photomultiplier or CCD
5Hela HepG2
6Hela HepG2
7M vs. A Plot
M Log Red - Log Green
A (Log Green Log Red) / 2
8M v A plots of chip pairs before normalization
9M v A plots of chip pairs after quantile
normalization
10Types of normalization
- To total signal (linear normalization)
- LOESS (LOcally WEighted polynomial regreSSion).
- To house keeping genes
- To genomic DNA spots (Research Genetics) or mixed
cDNAs - To internal spikes
11Fold change the crudest method of finding
differentially expressed genes
Hela HepG2
gt2-fold expression change
gt2-fold expression change
12What do we mean by differentially expressed?
- Statistically, our gene is different from the
other genes.
Distribution of average ratios for all genes
Number of genes
Log ratio
13Finding differentially expressed genes What
affects our certainty that a gene is up or
down-regulated?
- Number of sample points
- Difference in means
- Standard deviations of sample
14Practical views on statistics
- With appropriate biological replicates, it is
possible to select statistically meaningful
genes/patterns. - Sensitivity and selectivity are inversely
related - e.g. increased selection of true
positives WILL result in more false positive and
less false negatives. - False negatives are lost opportunities, false
positives cost s and waste time. - A typical set of experiments treated with
conservative statistics typically results in more
genes/pathways/patterns than one can sensibly
follow - so use conservative statistics to
protect against false positives when designing
follow-on experiments.
15Statistical Tests
- Students t-test
- Correct for multiple testing! (Holm-Bonferroni)
- False discovery rate.
- Significance Analysis of Microarrays (SAM)
- http//www-stat.stanford.edu/tibs/SAM/
- ANOVA
- Principal components analysis
- Special methods for periodic patterns in data.
16Volcano plot log(expr) vs p-value
p-value
Log(fold change)
17Scatter plot showing genes with significant
p-values
18Pattern finding
- In many cases, the patterns of differential
expression are the target (as opposed to specific
genes) - Clustering or other approaches for pattern
identification - find genes which behave
similarly across all experiments or experiments
which behave similarly across all genes - Classification - identify genes which best
distinguish 2 or more classes. - The statistical reliability of the pattern or
classifier is still an issue and similar
considerations apply - e.g. cluster analysis of
random noise will produce clusters which will be
meaningless.
19What is clustering?
- Group similar objects together.
- Genes with similar expression patterns.
- Objects in the same cluster (group) are more
similar to each other than objects in different
clusters.
20Clustering
- What is clustering?
- Similarity/distance metrics
- Hierarchical clustering algorithms
- Made popular by Stanford, ie. Eisen et al. 1998
- K-means
- Made popular by many groups, eg. Tavazoie et al.
1999 - Self-organizing map (SOM)
- Made popular by Whitehead, ie. Tamayo et al.
1999
21Typical Tools
- Expression NTI
- GeneSpring
- Affymetrix GeneChip Operating System (GCOS)
- Cluster/Treeview
- R statistics package microarray analysis
libraries.
22How to define similarity?
Experiments
genes
X
n
1
p
1
X
genes
genes
Y
Y
n
n
Raw matrix
Similarity matrix
- Similarity metric
- A measure of pairwise similarity or
dissimilarity - Examples
- Correlation coefficient
- Euclidean distance
23Similarity metrics
- Euclidean distance
- Correlation coefficient
Euclidean clustering magnitude
Direction Correlation clustering direction
24Sporulation-example
25Sporulation-example
26Self-organizing maps (SOM) Kohonen 1995
- Basic idea
- map high dimensional data onto a 2D grid of nodes
- Neighboring nodes are more similar than points
far away
27Self-organizing maps (SOM)
28SOM Clusters
29Inference
- NDT80 transcription factor
- Can account for control of many, not ALL, genes
with pattern - How do we find the other factor(s)
- Infer binding site
- DNA binding protein selection?
30Inferences from Expression
- Pathways not known to be involved
- Ontology?
- Novel genes involved in a known pathway
- like and unlike tissues
31Transcription FactorsRegulatory Networks
- ID co-regulated genes
- Search for common motifs
- Evaluate known motifs/factors
- Search for new ones.
- Programs MEME, etc.
32mRNA-protein Correlation
- YPD should have relevant data
- will yeast be typical?
- Electrophoresis 18533
- 23 proteins on 2D gels
- r0.48 for mRNAprotein
- Posttranscriptional and post translational
regulation important!
33Drosophila Fusion Project
Lots of introns
- Exon GFP vector
- Good site?
- Fluorescent sort
- Image
Lynne Cooley
34Developmental Localization
35Other microarray formats
- Single nucleotide polymorphism (SNP) chips
- Oligos with each of 4 nt at each SNP.
- Chromosomal IP chips (ChIPchip)
- Determine transcription factor binding sites
- Promoter DNA on the chip.
- Alternative splicing chips
- Long oligos, covering alternatively spliced
exons, or all exons. - Genome tiling chips
36ChIPchip--Identification of Transcription Factor
Binding Sites
- Cross link transcription factors to DNA with
formaldehyde - Pull out transcription factor of interest via
immunoprecipitation with an antibody or by
tagging the factor of interest with an isolatable
epitope (e.g GST fusion). - Fractionate the DNA associated with the
transcription factor, reverse the cross links,
label and hybridize to an array of protomer DNA. - Brown et.al. (2001) Nature, 409(533-8)
37Analysis of TF Binding Sites
38On to Proteomics
DNA?RNA ?Protein