Title: Microarrays
1Microarrays
Carsten Daub www.mpimp-golm.mpg.de daub_at_mpimp-golm
.mpg.de
2Microarrays
3Microarrays
- Short Introduction to Microarray Technology
- The data matrix
- Distance metrics
- Clustering techniques
- Relevance of networks
4Microarrays
DNA
Cell
RNA Polymerase
mRNA
labeled Nucleotides
Reverse Transcriptase
labeled cDNA
5Microarrays
- The conentration of labels in the cDNA is
measured! - Assumptions for interpretation of Microarray
data - mRNA is stable over time until cDNA is
transcribed - The concentration of cDNA equals the concentraton
of mRNA - The labeled nucleotides are inserted
statistically - The labels are often flourophors with a cyanine
structure
cy3
6Microarrays
labeled cDNA as probe
target
- cDNA Microarrays
- Immobilisation of whole cDNA
- cDNA is obtained via becteria
- Spotted by inkjet technology
- cDNA libraries are needed
- oligonucleotide Microarrays
- Immobilisation of characteristc sequences
- Sequences are designed
- Sequences are synthesized on the chip
- Several commercial products are available
7Comparing two situations
cDNA Microarrays
Situation A treated probe
Situation B wild type
A reference is needed to compare data from
different arrays.) Mixture of A and B is spotted
Extraction of intensities with picture analysis
software
8Comparing two situations
- 0 lt ratio lt Inf.
- Inf. lt log2(ratio) lt Inf.
- where
- log2(ratio) gt 0 increase
- log2(ratio) lt 0 decrease
9Absolute Intensities
10Data Matrix
11Data Normalisation
- Background noise
- The background is determined around each spot
- Some spots are left empty to determin background
- There are several different procedures to
substract background - The background around a point is substracted for
that point - The average background is substracted from each
point - A background profile is determined for the whole
array can also be used to determin systematical
errors of array
12Data Normalisation
- Correct for
- Differences in labelling and detection
efficiencies - Differences in quantity of initial mRNA
- Some widely used techniques
- Total intensity normalisation
- Assumption Quantity of initial mRNA is similar
for both labelled samples - Total integrated intensities are equal for both
samples - Re-scale intensities for each gene
- Normalisation using regression techniques
- Normalisation using ratio statistics
13Data Normalisation
- Some widely used techniques (continued)
- Normalisation using regression techniques
- Assumption mRNA derived from closely related
samples is used - Scatterplot of int(Cy5) vs. int(Cy3) has a
diagonal part with slope 1 - Re-scale intensities for each gene
- Local regression techniques also available for
non-linear scatterplots - Normalisation using ratio statistics
- Confidence limits are calculated via statistics
14Data Normalisation
15Data Normalisation
16Undefined values
- Gene expression datasets often contain undefined
values. - The background and the signal give similar
intensity - Surface of the chip is not planar (cDNA chips)
- The probe is not properly fixed on the chip
- hybridistion step didnt work properly
- Probe was not washed away properly
- Undefined values can be
- Discarted -gt leads to problems in data analysis
- Replaced with averages of rows/columns
- Replaced by zero
17Similarity of Genes
- Co-regulated genes show a similar behaviour
- There exists no a priori definition of similarity
- Different similarity measures are used
- Different patterns are seen with different
similarity measures
18Similarity of Genes
- Example
- An expression vector is defined for each gene
- The expression values are the dimensions (x1,x2,
..., xn) in expression space - Each gene is represented as one
- n-dimensional point in expression space
- Two genes with similar expression behaviour are
spatially nearby - Similarity is inverse proportional to spatial
distance
19Similarity of Genes
- Example (continued)
- Different clustering algorithms can be applied to
find clusters of genes
20Similarity of Experiments
- Instead of clustering genes it is also possible
to cluster experiments - Each experiment is regarded as experiment
vector in the experiment space
21Visualisation of Data Matrix
- The data matrix can be visualised to get an
impression of gene expression behaviour. - The order of genes in data matrix can be
reorganised for better recognition of expression
patterns.
22Distance Metrics
- For clustering algorithms the calculation of a
distance between gene vectors or experiment
vectors is a necessary step - Distances metrics can be classified as
- Metric distances
- Semi-metric distances
- Metric distances
- dab gt 0
- dab dba
- daa 0
- dab lt dac dca
- Semi-metric distances obey 1) to 3), fail in 4)
23Distance Metrics
Minkowski distance If q 1, d is Manhattan
distance (semi-metric distance) If q 2, d is
Euclidean distance (metric distance)
24Distance Metrics
Pearson correlation coefficient (semi-metric
distance)
-1 lt d(i,j) lt 1
25Distance Metrics
Rank-ordered Pearson correlation coefficient -gt
Spearman (semi-metric distance)
26Distance Metrics
Other variations of Pearsons correlation
coefficient Uncentered Pearson
correlation (semi-metric distance)
27Distance Metrics
Entropy based distances Mutual
Information (semi-metric distance)
- Mutual Information (MI) is a statistical
representation of the correlation of two signals
A and B. - MI is a measure of the additional information
known about one expression pattern when given
another. - MI is not based on linear models and can
therefore also see non-linear dependencies (see
picture).
28Clustering Techniques
- Several classification criteria of clustering
algorithms exist with regard to - Clustering result hierarchical not
hierarchical - Clustering process divisive - agglomerative
- Clustering criterion sequential - global
29Clustering Techniques
Hierarchical clustering
30Clustering Techniques
Hierarchical clustering phylogenetic tree
31Clustering Techniques
- Hierarchical clustering
- Various hierarchical clustering algorithms exist
- Single-linkage clustering, nearest-neighbour
- Complete-linkage, furthest-neighbour
- Average-linkage, unweighted pair-group method
average (UPGMA) - Weighted-pair group average, UPGMA weighted by
cluster sizes - Within-groups clustering
- Wards method
- ...
32Clustering Techniques
- k-means clustering
- The number of clusters k has to be chosen in
advance) - Initial position of cluster centers is random)
- For each data point the (euclidean) distance to
each cluster center is calculated - Each data point is assigned to its nearest
cluster - Cluster centers are shifted to the center of data
points assigned to this cluster center - Step 3. 5. is iterated until cluster centers
are not shifted anymore
33Clustering Techniques
k-means clustering (continued) A reasonable
number of cluster centers k can be estimated
34Clustering Techniques
k-means clustering (continued) The initial
position of cluster centers can be estimated by
the distribution of the data vectors
35Clustering Techniques
- Principal Component Analysis (PCA)
- Singular value decomposition (SVD)
- Reduction of effective dimensionality of
gene-expression space - by (linear) combination of initial dimensions
- Mathematically complex
36Clustering Techniques
- No clustering technique is better than another
- Different clustering techniques have shown to
lead to reasonable results depending on the
measured data - Organism
- Tissue
- Experimental condition, e.g. mutants or time
course experiments - Distance metric used
37Visualisation
Visualisation in clusters
38Relevance of Networks
- There is now abolute measure for quality of
clusters - Biological relevance of clusters can be checked
with biochemical knowlegde - Functional classifications are provided by some
databases, e.g. MIPS (http//gsf.mips.de) - Genes in clusters can be checked if a majority is
annotated to one or a few functional classes
39Conclusion
- Microarrays measure the concentration of mRNA
- Several assumptions are made in the progress of
analysis - Biological
- Numerical
- Experimental conditions are important!
- Different distances matrics combined with
different clustering algorithms can leed to
completely different results
40Software Sources
- AFM (Array File Maker) from Samuel Lunenfeld
Research Institute, Mount Sinai Hospital
(Toronto) http//www.mshri.on.ca/tyers/software.h
tml - AMADA (Analyzing Microarray Data) from Univ of
Hong Kong http//web.hku.hk/xxia/software/AMADA.
htm - ANOVA programs for microarray data from
Churchill's group at Jackson Lab
http//www.jax.org/research/churchill/software/an
ova/index.html - CLUSFAVOR (CLUSter and Factor Analysis using
Varimax Orthogonal Rotation), from Baylor
http//mbcr.bcm.tmc.edu/genepi/ - CLUSTER, TREEVIEW from Eisen's lab at Lawrence
Berkeley National Lab http//rana.lbl.gov/EisenSo
ftware.htm also, http//www.microarrays.org/softw
are.html - D-CHIP specifically for Affymetrix chip data,
from Wong's lab at Biostatistic Department,
Harvard University http//www.dchip.org - GENE-CLUSTER Whitehead Institute (registration
is required). A commercial version of this
program is also developed, at Affymetrix
http//www.genome.wi.mit.edu/MPR/software.html - J-EXPRESS Univ of Bergen, Norway
http//www.ii.uib.no/bjarted/jexpress/index.html
- ONTO-EXPRESS Wayne Univ http//vortex.cs.wayne.
edu/Projects.html - PAGE (PAtterns from Gene Expression) from U
Penn. http//www.cbil.upenn.edu/PaGE/ - PLAID from Owen's lab at Stanford Univ.
http//www-stat.stanford.edu/owen/plaid/ - SAM (significance analysis of microarrays) from
Tibshirani's group at Stanford Univ.
http//www-stat.stanford.edu/tibs/SAM/index.html
- SAM (significance of array measurement) from
Institute for Systems Biology http//www.systemsb
iology.org/VERAandSAM/ - SMA (Statistic for Microarray Analysis, in R or
S-PLUS subroutines) from Speed's lab at
Statistics Department of UC Berkeley
http//www.stat.berkeley.edu/users/terry/zarray/S
oftware/smacode.html - SVDMAN from Los Alamos National Laboratory
http//public.lanl.gov/mewall/svdman/ - TREE-ARRANGE and TREEPS from Univ Waterloo
http//monod.uwaterloo.ca/downloads/treearrange/ - VERA (variability and error assessment) from
Institute for Systems Biology http//www.systemsb
iology.org/VERAandSAM/ - XCLUSTER from Sherlock's lab of Genetics
Department of Stanford Univ http//genome-www.sta
nford.edu/sherlock/cluster.html
41Literature
- Sessions at Pacific Symposium on
Bioinformatics1998, 1999, 2000 - Gene Regulation Networks - a Finite State Linear
Model slide presentation by Alvis Brazma, the
EBI. - T. Akutsu, S. Miyano, and S. Kuhara. 1999.
Identification of Genetic Networks from a Small
Number of Gene Expression Patterns Under the
Boolean Network Model, Pacific Symposium on
Biocomputing 417-28. - A. J. Butte, I. S. Kohane. 2000. Mutual
information relevance networks Functional
genomic clustering using pairwise entropy
measurements.Pacific Symposium on Bioinformatics
2000. - B Dutilh. 1999. Gene Networks from Microarray
Data. Literature thesis, Utrecht University - P D'Haeseleer, S. Liang, R. Somogyi. 2000.
Genetic Network Inference From Co-expression
clustering to reverse engineering.
Bioinformatics in press. - P D'haeseleer, S. Liang, R. Somogyi. 1999. Gene
Expression Analysis and Genetic Network Modelin.
Tutorial session at Pacific Symposium on
Bioinformatics 1999. - P D'haeseleer, S. Fuhrman. 2000. Gene network
inference using a linear, additive regulation
mode. submitted to Bioinformatics. - S Huang. 1999. Gene expression profiling, genetic
networks and cellular states an integrating
concept for tumorigenesis and drug discovery. J
Mol Med 77, 469-480. - H de Jong, M Page. 2000. Qualitative simulation
of lage and complex genetic regulatory systems.
Proc of ECAI 2000, IOS Press. - P Smolen, DA Baxter, JH Byrne. 2000. Modelling
transcriptional control in gene networks -
methods, recent results, and future directions.
Bulletin of Math. Biol. 62, 247-292. - M. Wahde, J. Hertz. 2000. Coarse-grained reverse
engineering of genetic regulatory networks.
BioSystems 55, 129-136. - D.C. Weaver, C.T. Workman, G.D. Stormo. 1999.