Title: P1253814652hryiZ
1Statistical Methods for the Screeningand
Classification of Microarray Gene Expression Data
Geoff McLachlan Department of Mathematics
Institute for Molecular Bioscience University of
Queensland
http//www.maths.uq.edu.au/gjm
2Institute for Molecular Bioscience, University
of Queensland
3Liat Jones
Richard Bean
Justin Zhu
4Outline of Workshop
Part 1 Introduction to Microarray Technology
Part 2 Detecting Differentially Expressed Genes
in Known Classes of Tissue Samples
Part 3 Supervised Classification of Tissue
Samples
Part 4 Unsupervised Classification Cluster
Analyis of Tissue Samples and Gene
Profiles Part 5 Linking Microarray Data with
Survival Analysis
5A microarray is a new technology which allows the
measurement of the expression levels of thousands
of genes simultaneously.
- (1) Sequencing of the genome (human, mouse, and
others) - (2) Improvement in technology to generate
high-density - arrays on chips (glass slides or nylon
membrane)
The entire genome of an organism can be probed at
a single point in time.
6Draft of the Human Genome
Public Sequence Nature, Feb. 2001
Celera Sequence Science, Feb. 2001
7The Challenge for Statistical Analysis of
Microarray Data
Microarrays present new problems for statistics
because the data are very high dimensional with
very little replication.
The challenge is to extract useful information
and discover knowledge from the data, such as
gene functions, gene interactions, regulatory
pathways, metabolic pathways etc.
8Vital Statistics by C. Tilstone Nature 424,
610-612, 2003.
DNA microarrays have given geneticists and
molecular biologists access to more data than
ever before. But do these researchers have the
statistical know-how to cope?
Branching out cluster analysis can group samples
that show similar patterns of gene expression.
9Representation of Data from M Microarray
Experiments
Sample 1 Sample 2 Sample
M
Gene 1 Gene 2 Gene N
Assume we have extracted gene expressions values
from intensities.
Expression Signature
Expression Profile
10- It is assumed that the (logged) expression
levels have been preprocessed with adjustment for
array effects.
11- Majority of time on a data analysis project will
be spent - cleaning the data before doing any analysis
- Paradoxically, most statistical training assumes
that the data - arrive prelceaned. Students, whether in
PhD programs - or an undergraduate introductory course, are
not taught - routinely to check data for accuracy or even
to worry about it. - Exacerbating the problem further are claims
by software vendors - that their techniques can produce valid
results no matter what the quality of the
incoming data. -
- De Veaux and Hand (How to Lie with Bad
Data, Statist. Sci., 2005)
12Large-scale gene expression studies are not a
passing fashion, but are instead one aspect of
new work of biological experimentation, one
involving large-scale, high throughput assays.
Speed et al., 2002, Statistical Analysis of Gene
Expression Microarray Data, Chapman and Hall/ CRC
13Growth of microarray and microarray methodology
literature listed in PubMed from 1995 to 2003.
The category all microarray papers includes
those found by searching PubMed for microarray
OR gene expression profiling. The category
statistical microarray papers includes those
found by searching PubMed for statistical
method OR statistical techniq OR
statistical approach AND microarray OR gene
expression profiling.
14Mehta et al (Nature Genetics, Sept. 2004)
The field of expression data analysis is
particularly active with novel analysis
strategies and tools being published weekly, and
the value of many of these methods is
questionable. Some results produced by using
these methods are so anomalous that a breed of
forensic statisticians (Ambroise and McLachlan,
2002 Baggerly et al., 2003) who doggedly detect
and correct other HDB (high-dimensional biology)
investigators prominent mistakes, has been
created.
15Analyzing Microarray Gene Expression Data
16Analyzing Microarray Gene Expression Data
Analysis of Microarray Gene Expression Data
17Analyzing Microarray Gene Expression
Data Analysis of Microarray Gene Expression Data
The Analysis of Gene Expression Data
18Analyzing Microarray Gene Expression
Data Analysis of Microarray Gene Expression
Data The Analysis of Gene Expression Data
The Statistical Analysis of Gene Expression Data
19Analyzing Microarray Gene Expression Data (UQ,
Wiley) Analysis of Microarray Gene Expression
Data (Harvard, Kluwer) The Analysis of Gene
Expression Data (Johns Hopkins, Springer) The
Statistical Analysis of Gene Expression Data
(Berkeley, CH)
20Analyzing Microarray Gene Expression
Data Analysis of Microarray Gene Expression
Data The Analysis of Gene Expression Data The
Statistical Analysis of Gene Expression Data
Statistics for Microarrays
21Analyzing Microarray Gene Expression
Data Analysis of Microarray Gene Expression
Data The Analysis of Gene Expression Data The
Statistical Analysis of Gene Expression
Data Statistics for Microarrays
Design and Analysis of DNA Microarrays
22Analyzing Microarray Gene Expression
Data Analysis of Microarray Gene Expression
Data The Analysis of Gene Expression Data The
Statistical Analysis of Gene Expression
Data Statistics for Microarrays Design and
Analysis of DNA Microarrays
Exploration and Analysis of Microarrays
23Analyzing Microarray Gene Expression
Data Analysis of Microarray Gene Expression
Data The Analysis of Gene Expression Data The
Statistical Analysis of Gene Expression
Data Statistics for Microarrays Design and
Analysis of DNA Microarrays Exploration and
Analysis of Microarrays
Data Analysis Tools for DNA Microarrays
24In the sequel, references to most of the material
presented can be found in my joint
book, McLachlan, Do, and Ambroise (2004),
Analyzing Microarray Gene Expression Data,
Hoboken, NJ Wiley.
25(No Transcript)
26Contents
- Microarrays in Gene Expression Studies
- Cleaning and Normalization
- Some Cluster Analysis Methods
- Clustering of Tissue Samples
- Screening and Clustering of Genes
- Discriminant Analysis
- Supervised Classification of Tissue Samples
- Linking Microarray Data with Survival Analysis
27Distribution of References by Year
Year
2004 34
2003 73
2002 80
2001 93
2000 47 (67.8)
Total 481
28mRNA Levels Indirectly Measure Gene Activity
- Essentially every cell contains the same genes.
- Type and amount of mRNA produced by a cell tells
which genes are - being expressed
- Cells differ in the genes which are active at
any one time.
- Gene Expression is transcription of
- DNA to mRNA
- mRNA is translated to proteins
29Technical Background
Two recent advances
- Human Genome Project (also other sequenced
genomes mouse, dog etc)
- DNA microarray technology -- works by exploiting
the ability of a given mRNA molecule to bind
specifically to (hybridize) the DNA template from
which it originated
30What is a DNA microarray?
- Small, solid supports onto which the sequences
from thousands (tens of thousands) of genes are
attached at fixed locations.
- They may be glass slides, or silicon chips or
nylon membranes.
- The DNA is printed, spotted or synthesized
directly onto the support
- The spots can be DNA, cDNA or oligonucleotides.
31The microarray experiment
Spot DNA (known)
Sample (unknown)
32Microarrays Indirectly Measure Levels of mRNA
- mRNA is extracted from the cell
- mRNA is reverse transcribed to cDNA (mRNA itself
is unstable)
- cDNA is labeled with fluorescent dye TARGET
- The sample is hybridized to known DNA sequences
on the array - (tens of thousands of genes) PROBE
- If present, complementary target binds to probe
DNA - (complementary base pairing)
- Target bound to probe DNA fluoresces
33The microarray experiment
- mRNA from the cell (sample) is washed over the
surface HYBRIDIZATION
- measure the amount of bound mRNA at each spot
Allows the measurement of expression for
thousands of genes from the amount of bound mRNA.
34A Spotted cDNA Microarray Experiment
- Compare the gene expression levels for
- two cell populations on a single microarray.
- e.g. tumour and normal cells
35(No Transcript)
36 Microarray Image Red High expression in
target labelled with cyanine 5 dye Green High
expression in target labelled with cyanine 3
dye Yellow Similar expression in both target
samples
37Assumptions
Gene Expression
(1)
cellular mRNA levels directly reflect gene
expression
mRNA
intensity of bound target is a measure of the
abundance of the mRNA in the sample.
(2)
Fluorescence Intensity
38Experimental Error
Sample contamination
Poor quality/insufficient mRNA
Reverse transcription bias
Fluorescent labeling bias
Hybridization bias
Cross-linking of DNA (double strands)
Poor probe design (cross-hybridization)
Defective chips (scratches, degradation)
Background from non-specific hybridization
39Why are microarrays important?
- They contain a very large number of genes and
are very small. - Compare gene expression within a single sample
or in two different cell types or tissue samples - Examine expressions in a single sample on a
genome-wide scale (GENOMICS) - Infer new gene functions, diagnostic tools
e.g. in cancer provides a molecular view.
40The Microarray Technologies
Spotted Microarray
Affymetrix GeneChip
cDNAs, clones, or short and long
oligonucleotides deposited onto glass
slides Each gene (or EST) represented by its
purified PCR product Simultaneous analysis of
two samples (treated vs untreated
cells) provides internal control.
short oligonucleotides synthesized in situ onto
glass wafers Each gene represented multiply -
using 16-20 (preferably non-overlapping) 25-mers.
Each oligonucleotide has single-base mismatch
partner for internal control of hybridization
specifity.
relative gene expressions
absolute gene expressions
Each with its own advantages and disadvantages
41Pros and Cons of the Technologies
Spotted Microarray
Affymetrix GeneChip
More expensive yet less flexible Good for whole
genome expression analysis where genome of that
organism has been sequenced High quality with
little variability between slides Gives a
measure of absolute expression of genes
Flexible and cheaper Allows study of genes not
yet sequenced (spotted ESTs can be used to
discover new genes and their functions) Variabil
ity in spot quality from slide to slide Provide
information only on relative gene expressions
between cells or tissue samples
42Aims of a Microarray Experiment
- observe changes in a gene in response to
external stimuli - (cell samples exposed to hormones, drugs,
toxins) - compare gene expressions between different
tissue types - (tumour vs normal cell samples)
- To gain understanding of
- function of unknown genes
- disease process at the molecular level
- Ultimately to use as tools in Clinical Medicine
for diagnosis, - prognosis and therapeutic management.
43Importance of Experimental Design
- Good DNA microarray experiments should have
clear objectives. - Not performed as aimless data mining in search
of unanticipated patterns that will provide
answers to unasked questions - (Richard Simon, BioTechniques 34S16-S21, 2003)
44Replicates
Technical replicates arrays that have been
hybridized to the same biological source (using
the same treatment, protocols, etc.) Biological
replicates arrays that have been hybridized to
different biological sources, but with the same
preparation, treatments, etc.
45Extracting Data from the Microarray
- Cleaning
- Image processing
- Filtering
- Missing value estimation
- Normalization
- Remove sources of systematic variation.
Sample 1
Sample 2
Sample 3
Sample 4 etc
46(No Transcript)
47(No Transcript)
48Examples of spot imperfections. A. donut shape
B. oval or pear shape C. holey heterogeneous
interior D. high-intensity artifact E. sickle
shape F. scratches.
49Gene Expressions from Measured Intensities
Spotted Microarray
log 2(Intensity Cy5 / Intensity Cy3)
Affymetrix
(Perfect Match Intensity Mismatch Intensity)
50Data Transformation
Rocke and Durbin (2001), Munson (2001), Durbin et
al. (2002), and Huber et al. (2002)
51Representation of Data from M Microarray
Experiments
Sample 1 Sample 2 Sample
M
Gene 1 Gene 2 Gene N
Assume we have extracted gene expressions values
from intensities.
Expression Signature
Gene expressions can be shown as Heat Maps
Expression Profile
52- It is assumed that the (logged) expression
levels have been preprocessed with adjustment for
array effects.
53(No Transcript)
54(No Transcript)
55(No Transcript)
56(No Transcript)