Title: Microarray Data Analysis
1Microarray Data Analysis
- Stuart M. Brown
- NYU School of Medicine
2The Central Dogma of Molecular BiologyDNA is
transcribed into RNA which is then translated
into protein
Measured by Microarray
3What is a Microarray
- A simple concept Dot Blot Northern
- Reverse the hybridization - put the probes on the
filter and label the bulk RNA - Make probes for lots of genes - a massively
parallel experiment - Make it tiny so you dont need so much RNA from
your experimental cells. - Make quantitative measurements
4Microarrays are Popular
- At NYU Med Center we are now collecting about 3
GB of microarray data per week (60 chips, 6-10
different experiments) - PubMed search "microarray" 13,948 papers
- 2005 4406
- 2004 3509
- 2003 2421
- 2002 1557
- 2001 834
- 2000 294
5A Filter Array
6DNA Chip Microarrays
- Put a large number (100K) of cDNA sequences or
synthetic DNA oligomers onto a glass slide (or
other subtrate) in known locations on a grid. - Label an RNA sample and hybridize
- Measure amounts of RNA bound to each square in
the grid - Make comparisons
- Cancerous vs. normal tissue
- Treated vs. untreated
- Time course
- Many applications in both basic and clinical
research
7cDNA Microarray Technologies
- Spot cloned cDNAs onto a glass microscope slide
- usually PCR amplified segments of plasmids
- Label 2 RNA samples with 2 different colors of
flourescent dye - control vs. experimental - Mix two labeled RNAs and hybridize to the chip
- Make two scans - one for each color
- Combine the images to calculate ratios of amounts
of each RNA that bind to each spot
8Spot your own Chip (plans available for free
from Pat Browns website)
Robot spotter
Ordinary glass microscope slide
9(No Transcript)
10Combine scans for Red Green
False color image is made from digitized
fluorescence data, not by superimposing scanned
images
11cDNA Spotted Microarrays
12(No Transcript)
13Affymetrix Gene chip system
- Uses 25 base oligos synthesized in place on a
chip (20 pairs of oligos for each gene) - RNA labeled and scanned in a single color
- one sample per chip
- Can have as many as 20,000 genes on a chip
- Arrays get smaller every year (more genes)
- Chips are expensive
- Proprietary system black box software, can
only use their chips
14Affymetrix Gene Chip
15(No Transcript)
16(No Transcript)
17Affymetrix Technology
18(No Transcript)
19Affymetrix Pivot Table
20Data Acquisition
- Scan the arrays
- Quantitate each spot
- Subtract background
- Normalize
- Export a table of fluorescent intensities for
each gene in the array
21Automate!!
- All of this can be done automatically by
software. - Much more consistent
- Mistakes will be made (especially in the spot
quantitation) but you cant manually check
hundreds of thousands of spots
22(No Transcript)
23Affymetrix Software
- Affymetrix System is totally automated
- Computes a single value for each gene from 40
probes - (using surprisingly kludgy math) - Highly reproducible (re-scan of same chip or
hyb. of duplicate chips with same labeled sample
gives very similar results) - Incorporates false results due to image artefacts
- dust, bubbles
- pixel spillover from bright spot to neighboring
dark spots
24(No Transcript)
25Goals of a Microarray Experiment
- Find the genes that change expression between
experimental and control samples - Classify samples based on a gene expression
profile - Find patterns Groups of biologically related
genes that change expression together across
samples/treatments
26Basic Data Analysis
- Fold change (relative increase or decrease in
intensity for each gene) - Set cutoff filter for low values (background
noise) - Cluster genes by similar changes - only really
meaningful across multiple treatments or time
points - Cluster samples by similar gene expression
profiles
27Streamlined Affy Analysis
Normalize
Filter
Present/AbsentMinimum valueFold change
Raw data
(RMA)
Classification
Significance
Clustering
PAM Machine learning
t-test SAM Rank Product
Gene lists
28Sources of Variability
- Image analysis (identifying and quantitating each
spot on the array) - Scanning (laser and detector, chemistry of the
flourescent label)) - Hybridization (temperature, time, mixing, etc.)
- Probe labeling
- RNA extraction
- Biological variability
29Scatter plot of all genes in a simple comparison
of two control (A) and two treatments (B high
vs. low glucose) showing changes in expression
greater than 2.2 and 3 fold.
30Thomas Hudson, Montreal Genome Center
31Normalization
- Can control for many of the experimental sources
of variability (systematic, not random or gene
specific) - Bring each image to the same average brightness
- Can use simple math or fancy -
- divide by the mean (whole chip or by sectors)
- LOESS (locally weighted regression)
- No sure biological standards
32RMA
- Robust Multichip Average
- Bolstad, B.M., Irizarry R. A., Astrand, M., and
Speed, T.P. (2003), A Comparison of Normalization
Methods for High Density Oligonucleotide Array
Data Based on Bias and Variance. Bioinformatics
19(2)185-193
33Are the Treatments Different?
- Analysis of microarray data has tended to focus
on making lists of genes that are up or down
regulated between treatments - Before making these lists, ask the
question "Are the treatments different?" - Use standard statistical methods to evaluate
expression profiles for each treatment (t-test or
f-test) - If there are differences, find the genes most
responsible - If there are not significant overall differences,
then lists of genes with large fold changes may
only reflect random variability.
34Statistics
- When you have variability in measurements, you
need replication and statistics to find real
differences - Its not just the genes with 2 fold increase, but
those with a significant p-value across
replicates - Non-parametric (i.e. rank) or paired value
statistics may be more appropriate
35Multiple Comparisons
- In a microarray experiment, each gene (each probe
or probe set) is really a separate experiment - Yet if you treat each gene as an independent
comparison, you will always find some with
significant differences - (the tails of a normal distribution)
36False Discovery
- Statisticians call false positives a "type 1
error" or a "False Discovery" - False Discovey Rate (FDR) is equal to the p-value
of the t-test X the number of genes in the array - For a p-value of 0.01 X 10,000 genes 100
false different genes - You cannot eliminate false positives, but by
choosing a more stringent p-value, you can keep
them manageable (try p0.001) - The FDR must be smaller than the number of real
differences that you find - which in turn depends
on the size of the differences and varability of
the measured expression values
37SAM
- Significance Analysis of Microarrays
- Tusher, Tibshirani and Chu (2001) Significance
analysis of microarrays applied to the ionizing
radiation response. PNAS 2001 98 5116-5121, (Apr
24).
- Excel plugin
- Free
- Permutation based
- Most published method of microarray data analysis
38Higher LevelMicroarray data analysis
- Clustering and pattern detection
- Data mining and visualization
- Controls and normalization of results
- Statistical validatation
- Linkage between gene expression data and gene
sequence/function/metabolic pathways databases - Discovery of common sequences in co-regulated
genes - Meta-studies using data from multiple experiments
39Types of Clustering
- Herarchical
- Link similar genes, build up to a tree of all
- Self Organizing Maps (SOM)
- Split all genes into similar sub-groups
- Finds its own groups (machine learning)
- Principle Component
- every gene is a dimension (vector), find a single
dimension that best represents the differences in
the data
40Cluster by color difference
41GeneSpring
42(No Transcript)
43SOM Clusters
44Classification
- How to sort samples into two classes based on
gene expression data - Cancer vs. normal
- Cancer sub-types (benign vs. malignant)
- Responds well to drug vs. poor response (i.e.
tamoxifen for breast cancer)
45Support Vector Machines
Fat planes With an infinitely thin plane the
data can always be separated correctly, but not
necessarily with a fat one. Again if a large
margin separation exists, chances are good that
we found something relevant. Large Margin
Classifiers
46- PAM Prediction Analysis for Microarrays
- Class Prediction and Survival Analysis for
Genomic Expression Data Mining - Performs sample classification from gene
expression data, - via "nearest shrunken centroid method'' of
Tibshirani, Hastie, Narasimhan and Chu (2002) - "Diagnosis of multiple cancer types by shrunken
centroids of gene expression" - PNAS 2002 996567-6572 (May 14).
47BioConductor
- All of these normalization, statistical, and
clustering methods are available in a free
software package called BioConductor. - www.bioconductor.org
- User hostile command line interface
- Uses scripts in the 'R' statistical language
gt data(SpikeIn) gt pms lt- pm(SpikeIn) gt mms lt-
mm(SpikeIn) gt par(mfrow c(1, 2)) gt
concentrations lt- matrix(as.numeric(sampleNames(Sp
ikeIn)), 20, 12, byrow TRUE) gt
matplot(concentrations, pms, log "xy", main
"PM", ylim c(30, 20000)) gt lines(concentration
s1, , apply(pms, 2, mean), lwd 3) gt
matplot(concentrations, mms, log "xy", main
"MM", ylim c(30, 20000)) gt lines(concentration
s1, , apply(mms, 2, mean), lwd 3)
48Functional Genomics
- Take a list of "interesting" genes and find their
biological relationships - Gene lists may come from significance/classficatio
n analysis of microarrays, proteomics, or other
high-throughput methods - Requires a reference set of "biological
knowledge"
49Genome Ontology
- How to organize biological knowledge?
- Biologists work on a variety of different
research organisms yeast, fruit fly, mouse,
human - the same gene can have very different functions
(antennapedia) - and very different names (sonic hedgehog)
50GO
- Biologists got together a few years ago and
developed a sensible system called Genome
Ontology (GO) - 3 hierarchical sets of terminology
- Biological Process
- Cellular Component (location within cell)
- Molecular Function
- about 1000 categories of functions
51(No Transcript)
52(No Transcript)
53Biological Pathways
54Microarray Databases
- Large experiments may have hundreds of individual
array hybridizations - Core lab at an institution or multiple
investigators using one machine - data archive
and validate across experiments - Data-mining - look for similar patterns of gene
expression across different experiments
55Public Databases
- Gene Expression data is an essential aspect of
annotating the genome - Publication and data exchange for microarray
experiments - Data mining/Meta-studies
- Common data format - XML
- MIAME (Minimal Information About a Microarray
Experiment)
56GEO at the NCBI
57Array Express at EMBL
58(No Transcript)
59Gene ExpressionTechnologies
- cDNA (EST) libraries
- SAGE
- Microarray
- rt-PCR
- RNA-seq
60The Cancer Genome Anatomy Project
- CGAP has collected a large amount of cDNA and
related data online - http//cgap.nci.nih.gov/
- cDNA libraries from various tissues
- search for genes
- compare expression levels
61(No Transcript)
62(No Transcript)
63SAGE
- Serial Analysis of Gene Expression is a
technology that sequences very short fragments of
mRNA (10 or 17 bp) that have been randomly
ligated together - The short tags are assigned to genes and then
relative counts for each gene are computed for
cDNA libraries from various tissues
64SAGE Genie
- SAGE Anatomic Viewer
- SAGE Digital Gene Expression Displayer
- Digital Northern
- SAGE Experiment Viewer
65(No Transcript)
66(No Transcript)
67Microarray
- GEO database at NCBI
- Microarray experiments
- Defined arrays
- Published results
- Also lots of inconclusive experiments
- Tools to search for specific genes
- Unreliable to search for tissue or disease in
experiment description text
68(No Transcript)
69(No Transcript)
70(No Transcript)
71(No Transcript)
72RNA-seq
- Next Generation DNA seqencing
- NYU currently has one Illumina Genome Analyser
- generates more than 1 million RNA sequences per
sample - Currently seeking funding for a Roche/454
- produces 100K reads of 250-400 bp
73Count Transcripts
- Techology exists to accurately count transcripts
and compare samples - Digital Gene Expression
- Can also identify alternate isoforms, splice
variants, etc.