Title: Microarray Data Analysis
1Microarray Data Analysis
2Microarray Data Analysis
- Objectives
- Identify genes with differential gene expression
between two experimental conditions - Identifying patterns of gene expression of a
sample at a particular state - Understanding biological systems
3Normalization
- Adjust raw intensity values to correct for
intensity differences due to the procedure - Methods
- Global adjustment
- Intensity dependent
- Housekeeping normalization
- Print run normalizations
- Between slides
4Preliminary Analysis
- General distribution information
- Scatterplots (log R vs log G)
- MA plots (Mlog (R/G))
- Histograms
- Boxplots
5(No Transcript)
6(No Transcript)
7Histogram of distribution
8Differential Expression
- Goal Identify significant differences in gene
expression from the background - Problems
- Microarray experiments with many variables and
few replicates are not conducive for traditional
statistical tests - Possible Solution
- Use statistical models
- Concerns
- High variability among levels gene expressions
making it difficult to identify up-regulation or
down-regulation of a single gene - False negatives and false positives
9Parametric Methods
- T-test approaches
- Bayesian approach pool genes with similar
expression levels to better estimate the variance - Compute probability of gene expression from a
uniform distribution vs non-uniform distribution - ANOVA
10t-tests Comparison
11Non-Parametric Methods
- Nonparametric t-test
- Permutations
- Not very sensitive
- Wilcoxon rank sum test
12Xu and Li, Bioinformatics Vol 19(10) 2003
p1284-1289
13CLUSTER ANALYSIS
- Genes that belong to a particular pathway are
generally co-regulated and have similar
expression patterns - Correlating gene expression changes to identify
sets of genes with similar profiles - Method of organizing gene expression data, and
developing phylogenetic trees
14Terminology
- Vector Position of the gene in expression
space, a geometric coordinate - Expression space - The log2 ratios from the
experiment with n arrays, plotted in n dimensions - Clustering algorithms group genes based on their
separation in expression space
15 - Mean Centring Subtract the basal expression
level of a gene from each experimental
measurement - Distance Metric Measure of distance between
gene expression vectors - Metric distances
- Semi-Metric distances
16Distance Metrics
- Metric distances (dij between vectors i and j)
- dij positive definite
- dij symmetric
- dij zero distance from itself
- Triangle rule dik lt dij djk
- Euclidean distance d v S ( xi yi) 2 over i
1 to n. - Semi Metric distances
- Do not obey triangle rule
17Classification of Clustering Algorithms
- Hierarchical
- Produces a hierarchy of clusters
- Non-Hierarchical
- Partitions objects into different clusters
- Divisive
- Breaks down parent cluster into smaller clusters
- Agglomerative
- Fuses single-member clusters
18Classification continued
- Supervised
- Use known biological information to guide the
algorithm - Example SVM
- Unsupervised
- Make use of gene expression information to reveal
patterns in the data - Example Hierarchical, k-Means, SOM, PCA
19Hierarchical Clustering
- Most widely used
- Agglomerative approach, produces a single
hierarchical tree - Typically average linkage clustering is used.
- Disadvantages
- If a bad assignment is made, it cannot be
corrected - Expression patterns of genes become less relevant
with progress in clustering
20Hierarchical Clustering Algorithms
- Single Linkage Clustering
- Minimum distance between members of two clusters
- Complete Linkage Clustering
- Greatest distance between members of relevant
clusters - Average Linkage Clustering
- UPGMA
- Centroid or Median
21Visualization of Hierarchical clustering
- average linking
- complete linkage
- single linkage
22k-means Clustering
- Used when prior biological knowledge regarding
the number of clusters is known. - Partitions objects into fixed number (k) of
clusters. - No dendrograms produced
23Self-organizing Maps
- Neural network based divisive clustering
- Assigns genes to partitions based on the
similarity of their expression vectors to
pre-defined reference vectors - Requires prior knowledge
24Principal Component Analysis
- Provides a projection of complex data sets onto a
reduced, easily visualized space - Difficult to accurately define the precise
boundaries of distinct clusters in the data - Powerful when used with another classification
technique such as k-means clustering or SOM
25PCA and Hierarchical analysis
- Hierarchical (average linkage) clustering
- PCA
- Quackenbush J
26Supervised Methods - SVM
- Require previous information about which genes
are expected to cluster together - Support Vector Machines (SVM)
- Binary classifier
- Kernel function
- Classification of samples
- Molecular expression fingerprinting
27Comparison of clustering methods
- Hierarchical clustering with correlation
- Clustering by k-means
- Diana
- Fanny
- Model-based clustering
- Hierarchical clustering with partial least squares
28Comparison of clustering methods
- Data used
- Transcriptional program of sporulation in budding
yeast (Chu et al) - 7 time points
- 4975 genes
- Simulated data (Quackenbush)
- 10 time points
- 450 genes
29Hierarchical and k-means clustering
- Hierarchical clustering
- Produces a hierarchy of clusters
- Average linkage clustering UPGMA
- d(x, y) 1 - corr(x, y)
- corr(x, y) -gt statistical correlation between
expression profiles of x and y - K-means clustering
- Number of clusters determined using Hierarchical
clustering
30Diana
- Divisive clustering method
- Cluster C with cardinality n(C)
- x1 ? C
- Splits into
- x1, .., xk and
- C \ x1, .., xk
31Fanny
- Fuzzy logic
- Used L1 (Manhattan distance)
- Computes probability vectors for all genes that
minimize the objective function - Assigns genes to the group with highest
probability
32Model-based clustering
- Modeling expression profiles by mixtures of
multivariate normal distributions - Density function
- Likelihood of genes with expression profiles
x1,,xn - Group levels ? obtained by maximum likelihood
that maximizes L jointly in ? and ?
33Hierarchical clustering with partial least squares
- xi is the vector of expression ratios for the ith
gene - Partial least squared model of x1 on x2,..,xM
- Symmetrized coefficient for any gene pair (i, j),
i ? j
34Validation
- Average proportion of non-overlap measures
- Average distance between means measure
- Average distance measure
- K number of clusters
- Expect small values
35Results for sporulation data
36Results of sporulation data
- Average distance between means measure
- Average distance measure
37Results for simulate data
- Hierarchical clustering performed poorly
- Diana and Model based performed best
- K-means and Fanny performed well
38Model profile
- Training set of seven temporal classes
- Model temporal profile
- For each class the average of the log expression
ratio of all the genes in that class are plotted
over the seven time points - Same for 5 algorithms
- Fanny not used
39Comparison with model profiles
40Comparison with model profiles
41Comparison with model profiles
42Obejctive comparison
- Distance measure
- Total distance from model profile
43Discussion
- End result is dependent on clustering method
- Choose based on
- Visual plots
- Validation techniques
- Compare with model profiles
- Application of more than one technique
44Microarray Software
- To assist in management / organization of the
data and help in the data analysis process - Commercial and open-source available
- Popular brands
- Gene Spring
- Genesis
- MIDAS
45A Gene-Coexpression Network for Global Discovery
of Conserved Genetic Modules (Stuart et al,
Science Vol. 302 2003)
- Evolutionary divergent organisms
- Home sapien, Drosophia, C. elegans, S. cerevisiae
- Importance
- Only used genes with orthologs to develop
networks between species - Dealing with multiple species which found to have
tighter clusters and more exclusive
46Gene-Coexpression Networks
- Methods
- Group genes by BLAST
- Identify gene groups that are functionally
related by prior microarray data - Identify genes with significant Pearson
correlations indicating conserved coexpression - Created a coexpression network with based on
correlations - Validated by methods of permutations adding
noise
47Stuart et al, Science Vol 302 Oct 10 , 2003 p
249-255
48Stuart et al, Science Vol 302 Oct 10 , 2003 p
249-255
49Gene-Coexpression Networks
- Results
- Identification of genes involved with particular
biological processes (that were previously
unknown) because genes linked in network likely
to be involved in similar processes - Selective forces impacting interconnected classes
of genes over evolution
50References
- Baggerly et al, J. of Comp Bio Vol 8(6) 2001,
p639-659 - Chen, G. et al., (2002) Evaluation and comparison
of clustering algorithms in analyzing ES cell
gene expression data. Statistica Sinica, 12, 241
262 Dozmorov and Centola, Bioinformatics Vol
19(2) 2003, p204-211 - Hatfield et al, Molecular Microbiology Vol 47(4)
2003, p 871-877 - Kerr, M.K. and Churchill, G.A. (2001)
Bootstrapping cluster analysis assessing the
resliability of conclusions from microarray
experiments. Proc. Natl Acad. Sci. USA, 98, 8961
8965 - Pam, Bioinformatics Vol 19(11) 2003, p 1333-1340
- Sambrook J (ed), DNA Microarrays A Molecular
Cloning Manual - Slonim D, Nature Vol 32 2002 p 502-508
- Stuart et al, Science Vol 302 2003 p249-255
- Troyanskaya et al, Bioinformatics Vol 18(11)
2002, p1454-1461 - Tsodikov et al, Bioinformatics Vol 18(2) 2002, p
251-260. - Wilson et al, Bioinformatics Vol 19(11) 2003,
p1325-1332
51References
- Datta, S., Datta, S. (2003) Comparisons and
validation of statistical clustering techniques
for microarray gene expression data.
Bioinformatics 19, 459 466. - Eisen, M.B., Spellman, P.T., Brown, P.O. and
Botstein, D. (1998) Cluster analysis and display
of genome-wide expression patterns. Proc. Natl
Acad. Sci. USA, 95, 14863 14868. - McLachlan, G.J., Bean, R.W. and Peel, D. (2002) A
mixture model-based approach to the clustering to
microarray expression data. Bioinformatics, 18, 1
10. - Quackenbush, J. (2001) Computational analysis of
microarray data. Nat. Rev. Genet., 2, 418 427. - Waddell, P. and Kishino, H. (2000) Cluster
inference methods and graphical models evaluated
on NC160 microarray gene expression data. Genome
Informatics, 11, 129 140. - Xu and Li, Bioinformatics Vol 19(10) 2003, p
1284-1289 - Yeung, K., Haynor, D.R. and Ruzzo, W.L. (2001)
Validating clustering for gene expression data.
Bioinformatics, 17, 309 318